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FOREWORD — 


In this effort at 
reviewing research related to the education of disadvantaged children and 
youth we sought to.examine the problem in the context of historical efforts 
to accelerate the educational development of disadvantaged populations and 
the theoretical context of concern with the mutability of intellectual func- 
tions. The issue plans included two articles concerned with conditions of 
life: rural residence, and health and nutritional status. Unfortunately 
other commitments of one of our authors prevented him from completing 
the article concerning health status. We apologize for this gap, particularly 
since it represents one of the most neglected areas in educational research. 
Four processes thought to influence the development of disadvantaged 
youth are included: socialization, ethnic desegregation, decentralized partici- 
pation and transition to post-secondary education. The formal educational 
process as represented by curriculum modification was not included since 
little substantial progress was made in this area since the 1965 Review. 
Activity on the curriculum front has greatly increased, but results speci- 
fically referrable to the disadvantaged continue to be relatively modest. 


In the development of this issue of the Review of Educational Research 
we have had the active assistance of many people. Most of the work has 
been supported or underwritten by the U.S. Office of Education Educational 
Resources Information Center. The ERIC Clearinghouse on Rural Educa- 
tion and Small Schools at New Mexico State University, directed by Everett 
-D. Edington, and the ERIC Information Retrieval Center on the Dis- 
advantaged at Teachers College, Columbia University, have been deeply 
involved. The Cooperative Community Educational Resources Center of 
Boulder, Colorado also provided initial bibliographic search assistance. 
Appreciation is expressed to the several authors and their consultants. 
The editorial assistance of Lorna Duphiney, Anne Lewis, and Carol Lopate 
is deserving of special note along with the continuing assistance and sup- 
port of the staff at the ERIC IRCD. Special words of thanks go to Adelaide 
Jablonsky, Associate Director of ERIC IRCD until September, 1969, and 
to Gene V. Glass, the General Editor, for their consultative assistance 
throughout the development of this issue. 


Edmund W. Gordon 
Issue Editor 


1: INTRODUCTION 


Edmund W. Gordon 
Columbia University 


The decade of the 1960s 
has been marked by tremendously increased concern for the education of 
underdeveloped segments of our population. The enthusiasm and great 
expectations noted in many of the initial efforts are now balanced by im- 
patience, sobriety, and some degree of pessimism. The problem has not 
subsided in response to a declared intent to attack it. The complexity 
of this problem became more clear when it became evident that simple 
changes in the quality of facilities, increase in the personnel assigned and 
services provided, or modest shifts in curriculum emphases, did not effect 
significant improvements in the quality of learning. Educators are just 
beginning to realize that they confront tremendously complex problems 
when they seek to reverse the negative impacts of educational deprivation, 
social insulation, ethnic discrimination and economic deprivation. Increas- 
ingly, it is sensed that the problem is not simply pedagogical but involves all 
aspects of the community. At the same time that the breadth and complex- 
ity of the problem becomes clearer, it is also becoming evident that concern 
for this problem serves not only political and humanitarian concerns. The 
application of pedagogical concern, competence and skill to the improved 
education of the disadvantaged is forcing education to give more serious 
attention to some of the basic problems of teaching and learning. Reduced 
to its essence, the crucial pedagogical problem involved is that of under- 
standing the mechanisms of learning facility and learning dysfunction and 
applying this knowledge to the optimum development of a heterogeneous 
population characterized by differential backgrounds, opportunities, and 
patterns of intellectual and social function. 

Research related to the education of the disadvantaged covered a wide 
variety of approaches and issues. However, most of this work can be 
classified under two broad categories: the first of these might be called the 
study of population characteristics, and the second, the description and 
superficial evaluation of programs and practices. In the first of these 
laced heavy emphasis on the enumeration and 
ded as deficits in the conditions or behavior 
the groups studied differ from an alleged 
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dysfunctions in educational performance among the members of those 
groups. 

Studies within this category can be further divided between research 
into performance and research into life conditions. The largest body of 
research concerns one type of performance characteristic, intellectual status. 
Most studies in this area concentrate on IQ testing results and consistently 
support the hypothesis that high economic, ethnic or social status is asso- 
ciated with high intellectual status; the reverse also shows consistently— 
that is, poor children earn lower scores than rich children; lower class 
children score lower than middle and upper class children; blacks and other 
minority groups do not do as well as whites. Some of these studies included 
some speculative discussion in an attempt to interpret their results. In one 
extreme, investigators see their work as supporting assumptions concerning 
genetic determinants of intelligence; at the other end of the spectrum, 
researchers report their work in a context of support for the prominence 
of environmental factors in determining intellectual status. A few have 
continued the long tradition of alleged association between race, genetics, 
and quality of intellect. However, the largest group now working in the 
field tends to interpret the data as reflective of a complex interaction 


between hereditary and environmental forces in the determination of the 
quality of intellect. 


In contrast to the huge body of research and statistics concerning 
intellectual status, there has been only a limited amount of effort directed 
at the analysis of qualitative aspects of intellectual function in target popu- 
lations. Work in this second area included the efforts of some who tried 
to factor-analyze standardized tests, and many who tried to identify dif- 
ferential deficits and strengths in the functioning of the disadvantaged 
populations. A landmark in this area is the work of Lesser and associates, 
who tried to identify by ethnic groups differential patterns of intellectual 
function. Their work focuses on the identification of qualitative aspects of 
intellectual function, as opposed to the more common concern with the 
identification of intellectual status in quantitative terms. 

A third area encompassing a considerable body of research deals 
with plasticity of intellectual development. Building upon Binet’s early 
concern with the trainability of intellect and Montesorri’s efforts at modi- 
fying intellectual function in children of subnormal performance levels, 
investigators worked with mildly or severely retarded children as well as 
with normal and below-average children. These studies led to mixed 
findings, yielding no definitive conclusion about the plasticity of intellect. 
Some reports show intervention to be associated with modest or no sig- 
nificant change in intelligence test scores. Many of these modest changes 
have been interpreted as reflecting a normal fluctuation in intellectual 
function from one test period to another. However, some studies have 
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shown significant differences in the level of post treatment test scores, but 
these improvements have not yet been shown in large population studies 
or over a period of time long enough to justify definitive conclusions. In 
these more promising studies some skeptics maintained that the initial 
measurement of intellectual status was wrong, and they elected to use 
the final higher score as the more accurate measure. However, there con- 
tinues in such research the nagging conviction that Binet may have been 
right, that intelligence is a trainable function. The best available data 
leads one to conclude that significant changes in the quality of intellectual 
function are more likely to result to the extent that radical positive changes 
in environmental interactions occur before the consolidation of adolescence. 
There is little research or opinion supportive of the view that intellectual 
function continues to be significantly malleable into the late adolescent and 
adult periods of life. 


A contribution worthy of note in this area is that of Zigler, who 
attempts to account for changes in the quality of intellectual function on 
the basis of changes in the affective state (motivation, task involvement, 
ete.). He uses his data to support the conclusion that affective processes 
are more malleable than cognitive processes, and implies that changes in 
the affective area as well as concerted attention to the conative (skills 
development) area can result in limited shifts in cognitive achievement. 
However, these changes are not viewed as shifts in the character of the 
basic cognitive function, but rather as the products of motivation and skill 
development. Zigler’s view of the stable character of the basic cognitive 
processes reflect the position advanced by Bloom, who argues that the 
processes underlying intellectual functions rapidly lose their plasticity 
following about the third year of life. However, Bloom does not make 
it clear whether he views this stability as resulting from a rapidly reduced 
susceptibility to environmental influences or a heightened dependence 
upon earlier established patterns of stimulus-processing. Zigler’s concern 
.with the possible bi-modal distribution of intelligence, with greater repre- 

. sentation of abstract processes of mentation in the higher modal score group 
and higher representation of concrete operations in the lower score group, 
may provide an adequate explanation for the stability both investigators 
refer to. 

In a recent report, Jensen supported this view of the stable nature of 
the cognitive aspect of intellectual function by arguing for a view of in- 
telligence as primarily genetically determined. This retreat to genetics 
to explain the failure of education to significantly modify academic achieve- 
ment in minority and low income groups is considered by many to be 
unfortunate. Not only have educational treatments been insufficiently 
refined or applied but genetic research is certainly not far enough advanced 
to enable us to say to what extent heredity accounts for the quality of 
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intellectual function. Our best judgment leads to the conclusion that a 
complex interaction prevails which is as yet not clearly understood. Since 
educators are not able to influence the genetic constitutions of learners, 
our only option is to work harder at making that interaction significant. 


There is an abundance of performance characteristics data relating 
to school achievement and school inefficiency and neglect in disadvantaged 
populations. Whether the populations studied are urban or rural, poor 
blacks, poor whites, poor Puerto Rican, poor Mexican-American or poor 
American Indians, when attrition rate is used as an index to school ineffi- 
ciency and neglect, disadvantaged youngsters rate high. There are no 
characteristics which identify these young people more accurately than 
the low income status of their families, the disorganization of their com- 
munity and family life and their low academic achievement. Similarly 
children from low income and minority groups consistently show lower 
achievement levels. However, at least one significant observation emerges 
from these consistently pessimistic data. The patterns of achievement 
differentials among lower status groups are not independent of the cir- 
cumstances under which they attend school. Data collected through the 
1930s and 1940s showed that blacks educated in the South attained lower 
achievement levels than black children educated in Northern schools. As 
Southern blacks moved north and west, their achievement levels rose. In 
these instances researchers have not been able to determine whether the 
improvement was a result of selective migration, better schools or both. 
In contrast, data compiled during the 1960s tended to show black young- 
sters in urban Southern schools demonstrating higher achievement levels 
than black students in urban Northern schools, Here, again, it is not clear 
whether the difference should be attributed to the pattern of migration, 
perhaps changed in this decade, or to the relative quality of the schools. 
It is also to be noted that during the late 1940s and early 1950s the South 
was pouring money into its black schools in an effort to avoid desegregation, 
while at the same time the quality of education in the Northern urban 
ghetto schools was drastically declining as black enrollment in these schools 
increased to the point that in some instances blacks were in the majority. 

An area of research complementing these examinations of performance 
characteristics among disadvantaged children is the study of their life 
conditions, and how these factors may relate to school performance. In 
terms of school environment, one important condition is the degree of 
separation imposed on the child as a result of his ethnic origin, his socio- 
economic status, or both. The great body of research concerning the rela- 
tionship between racial separation and school achievement is associated 
with the concentration of minority group students in separate school 
situations, with the one possible exception of those schools predominantly 
populated by Oriental children. Similarly, the evidence overwhelmingly 
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supports the association between separation by economic group and school 
achievement with low economic status being associated with low school 
achievement. Consistently, poor children attending school in poor neigh- 
borhoods tend to display low-level school achievement. 

Before-and-after studies of desegregated schools have also tended to 
show that achievement levels rise with desegregation, although the exact 
interplay of reactions leading to this result has not yet been conclusively 
defined. For example, the process of desegregation may, by improving 
teacher morale or bringing about other changed conditions, result in an 
over-all increase in the quality of education throughout the system. There 
have been a number of studies examining the possible relationship of 
integration (along racial or status group lines) and achievement, and the 
over-all results of these efforts appear to demonstrate that children from 
lower status groups attending schools where pupils from higher status fami- 
lies are in the majority attain improved achievement levels, with no signifi- 
cant lowering of achievement for the higher status group. However, when 
children from higher status groups are in the minority in the school, there 
tends not to be an improvement in the achievement of the lower status 
group. 
Although these findings are generally supported in mass data compiled 
from large-scale populations, studies of minority group performance under 
experimental conditions of ethnic mix suggest a need for caution in mak- 
ing similar observations for smaller populations and individual cases. From 
these findings it becomes clear that the impact of assigned status and 
perceived conditions of comparison (that is, the subjects’ awareness of the 
norms against which their data will be evaluated) results in a quite varied 
pattern of performance on the part of the lower status group subjects. Thus, 
it may be dangerous to generalize that across-the-board economic and social 
class integration will automatically result in positive improvement for the 
lower status group. 

A neglected area in educational research has been the investigation 
of the relationship between school performance and health status of dis- 
advantaged children. Possibly this is at least in part because it has been 
considered obvious that a child suffering from poor health and inadequate 
nutrition will be adversely affected in his school life. It is clear that the 
lower-status families generally suffer from poorer health and nutrition and 
receive less and lower-quality medical care in less adequate facilities than 
do more advantaged families. The effects of this health disadvantage on 
children of school age are generally thought to be related to impaired 
efficiency in school performance rather than to any actual impairment of 
mental function; although the distribution of health care is relatively poor 
in this country, and more so for lower-status groups, poor health care and 
impaired nutrition are probably tolerable with respect to actual brain dam- 
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age. These conditions are probably not sufficiently severe to result in organic 
changes in the nervous system. 


It is in the area of the health status of the pregnant mother that re- 
searchers have found the possibility of a significant relationship between 
heath and nutritional status and the intellectual function of the de- 
veloping child. It has been reported that children of women who become 
pregnant in the late spring have a higher incidence of mental retardation, 
a possible result of poor diet during the first trimester which for them 
occurs in the summer, One could infer that women with chronically inade- 
quate diets have a higher incidence of mentally retarded children. 


It has also been shown that pregnant mothers from lower socio-eco- 
nomic groups receive poor prenatal care compared to other mothers. Many 
of these women do not see a physician until the eighth or ninth month of 
pregnancy, and even then may consult a general practitioner rather than a 
specialist. Some see the specialist for the first time at delivery and still 
others are never seen by a physician. There is a significantly higher inci- 
dence of premature births among lower socio-economic groups, with a re- 
sulting risk of congenital abnormality and abnormal postnatal development. 
Poor nutrition again may be a factor in this context, for women from lower 
economic groups tend to be smaller and shorter, thus running a greater risk 
of giving birth prematurely. Once born, normal or otherwise, these children 
receive little, inadequate or no postnatal medical care. Aside from possible 
defects in the organism which may result from inadequate nutrition and 
prematurity, there is evidence of a relationship between height and weight 
and school achievement, with underdeveloped children performing less well. 
It may be inferred that a disturbance in the growth pattern tends to relate 
to a disturbance in the achievement pattern, but it is not clear whether the 
effect is due to mental impairment or to lowered energy levels. A third 
possibility is that the abnormal growth pattern may be the result of some 
chronic illness which, though not interfering with mental function, may 
affect the child’s exposure to learning opportunities. At present, such rela- 
tionships between low health status and poor academic achievement are 
unclear, and the precise nature of cause and effect in different combinations 
of circumstances remains to be determined. It is clear, however, that there 
exist highly suspicious relationships between poor health, poor nutrition, 
questionable pregnancy and birth history and low school achievement. 


A more heavily documented area is the family life of groups considered 
disadvantaged. There have been three major trends in this area. A great 
deal of effort has been spent in collecting demographic data of various 
kinds, Others have attempted in various ways to document coping patterns 
and life styles in disadvantaged families. Still other studies have concen- 
trated on patterns of child rearing with attention to such aspects as lan- 


6 


GORDON INTRODUCTION 


guage modeling, task orientation, cognitive stimulation, and value orienta- 
tion. 

Demographic studies have fallen into several categories, with one 
type concentrating on the economic, employment, and educational levels of 
the family; other investigations were examinations of various aspects of 
family disorganization such as consentual marriage, out-of-wedlock 
children, divorce rates, broken homes, etc. A third type of demographic 
study was the examination of the matriarchal or female-dominated house- 
hold. The most important summary of this type of material was provided 
by the Moynihan monograph on the Negro family, which concentrated 
on the problems and deficiencies of this segment of the population in an 
effort to build up a case for emergency federal government intervention. 
This report concludes that improvement of the role of the Negro male and 
enhancement of the integrity of the Negro family are central and crucial 
in efforts toward the optimal development of the mass of Negro people 
in this country. 


The work of Moynihan came under sharp criticism because of its 
heavy emphasis on negative characteristics and because of its implied asso- 
ciation between ethnic characteristics of a population and the negative 
family characteristics described. Critics stressed the relationship between 
low economic status and negative characteristics as an equally viable ex- 
planation, suggesting that when economic status is controlled for, the 
quality of life in poor black families is not appreciably different from that 
in poor white families save for the fact of racial discrimination. Critics also 
felt that Moynihan’s report had not made sufficient use of the work of 
investigators such as Hylan Lewis, who reports a wide variety of coping 
skills in black low income families and who calls attention to the contribu- 
tion of the extended family and the impact of welfare laws and regulations 
on the incidence of reported paternal absence from the home. 


In the controversy over the status of research into the black family, 
_ the politics of the current period has seriously interfered with the objec- 
tive examination of material, Justifiable as the criticism of Moynihan on 
political grounds may be, the major thrusts of his research review are 
supported by earlier works, such as those by E. Franklin Frazier, the late 
noted black sociologist. The problem here may reflect the difference be- 
tween the objective collection of data and the interpretation of those data 
in the context of dynamic social and political climates. 

The work of Oscar Lewis represents an effort at understanding the 
dynamics of the life of Mexican-American and Puerto Rican families. 
Lewis’s works seem to be directed at both specialized and popular audi- 
ences. The dual focus of the works results in a conflict between reporting for 
a popular audience and serious investigation of social science problems. 
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This is particularly obvious in the Puerto Rican study in which a family 
with serious problems is studied. At its worst, such work is thought to 
demean a whole population by placing emphasis on what may be col- 
loquially regarded as negative or sordid aspects of the family life examined. 
Such work reported in a period of descending emphasis on civil rights 
would very likely have been less negatively received; but reported in a period 
of ascending concern for civil rights, the value of the research is con- 
siderably reduced by the inflamatory nature of the work’s almost exclusive 
emphasis on the deficiencies and negatively accorded atypicalities of those 
families studied. 


In efforts to understand child rearing patterns and their impact on 
developmental patterns of disadvantaged children, there has been an em- 
phasis on the difference between patterns in these families and those of 
more privileged groups. However, in this research there appears to be 
greater attention given to trying to understand the relationship between 
patterns identified and the process of development, so that the quality and 
style of intellectual function in the mother are associated with work habits 
and skill achievement of the youngsters. Familial styles in decision-making 
are associated with language usage in the child. Attitudes toward school 
and support for learning activities are associated with school achievement. 
In these research efforts, the attempt is not limited to identification of 
atypical patterns and generalized speculation about possible impacts, but 
is directed at the demonstration of relationships between these patterns 
and developmental patterns in the behavior of the youngster. 


In contrast to the varied, detailed, and sometimes adequately designed 
research into population characteristics among disadvantaged groups, the 
examination of educational programs and practices for children of such 
populations has too often been characterized by only superficial descrip- 
tion and evaluation. One would expect that the substantial research focused 
on population characteristics would be reflected in new treatments; how- 
ever, there was relatively little effort at relating development and treatment 
efforts to the population characteristics studied. Rather, treatments tended 
to emerge from the special biases or dominant models in the field, with 
either the fact of intervention or the magnitude of intervention receiving 
more attention than the specific nature or quality of intervention. This 
tendency may account for the fact that much of the research referrable 
to treatment and programs is characterized by superficial description of 
program or practice and general evaluation of impact. 


This type of research is most heavily represented in the literature by 
reports which identify the major program elements or the central features 
of the practice. It covers a wide range of levels from pre-school through 
adult and community focused activities. These program descriptions reflect 
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a period of preoccupation with the mounting of demonstration projects, as 
opposed to a concern for controlled experimentation. With considerable 
support from private foundations and the federal government, work in the 
first half of this decade consisted primarily of demonstration projects in 
major concentrations of the target population. The research contribution 
of many of these projects was the development of models which have greatly 
influenced the field, often without evaluation or serious examination. 
Several of these projects moved from demonstration to model program and 
technique dissemination. 

This concern with demonstration and model programs eventually led 
to an increased interest in evaluation, but attempts in this area tended 
to be directed at answering the question “Did the treatment work?” The 
data sought in answer to this question were excessively dependent upon 
subjective impressions of either the people providing the service or those 
receiving the treatment. One aspect of the classical experimental model 
was frequently used, resulting in the comparison of treated groups with 
untreated control or comparison groups. 


Despite the large number of programs and efforts (and the vast amounts 
of money spent for some), significant positive findings have been rare, 


leading many investigators to conclude that compensatory education has 


failed. In the cases where positive findings are reported, it is difficult, if 
not impossible, to identify or separate treatment effects responsible for the 
result from general Hawthorne effects (brought about by possible impact 
of a changed situation) or from Rosenthal effects (the result of the impact 


of changed expectations). 
These demonstration and evaluation efforts reflect a search for generic 


treatments, a desire to find the program or practice that works for large 
this tendency can be seen as a reflection of the generic 


lation characteristics, which tends to give the 
impression that we are dealing with a large, homogeneous group with 
. common problems of development. Very little of this research is directed 
at carefully designed and controlled experimentation or at qualitative 


analysis of large samples of naturally occurring programs to identify rela- 


ferential learner characteristics and differential treat- 


tionships between dif $ e 
ment characteristics. Questions as to what works for which children under 
eflected in available research to 


what specific conditions are not heavily r 
date. 

One specific condition which held the attention of educators for 
years is the existence or absence of racial or class separation in the school 
situation. This concern led to a heavy representation in research literature 
of studies not only concerned with the effects of racial segregation or de- 
segregation, but also examining more broadly equality of access and 


numbers of people; 
nature of research on popu 
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opportunity to a variety of educational programs. Studies in this area fall 
into several categories of subject matter and method. Some attempt to show 
differential status of academic achievement in schools predominantly serv- 
ing one or another ethnic group. Other efforts concentrate on the status 
of resources available in the school. Some studies examined the characteris- 
tics of staffs in such schools. Another type of study is directed toward actual 
or projected performance of pupils under different patterns of opportunity. 
Some research was concerned with socio-political processes influencing 
access to equal educational opportunities, with the most important work in 
this area being the Coleman report, Equality of Educational Opportunity, 
and the Civil Rights Commission report, Racial Isolation in the Schools. 


Findings in this area are not entirely clear, although Coleman found 
relatively little difference in achievement levels which could be attributed 
to the differences in the quality of school facilities. On the whole, he re- 
ported little difference in the quality of facilities for rich and poor, black 
and white, although several prior studies lead to the conclusion that 
quality of educational facilities and opportunity is related to family income 
status. The possible conflicts in these bodies of data may be related to 
temporal factors. Many of the prior studies were conducted at a time when 
there was a greater difference between schools, given the very dynamic 
situation with respect to the quality of schools peculiar to the period from 
1950 to 1965. It is also possible that Coleman’s heavy emphasis on static 
as opposed to more dynamic qualitative aspects such as teacher-learner 
relationships is also a cause of that conflict. 


Evidence from the Coleman Report and a number of other sources 
supports the conclusion that home conditions, general conditions of life, 
are more important predictors of school achievement than any of the 
variables that were studied. Although it is probably true that for an indi- 
vidual child, good schooling can possibly overcome many of the limitations 
of his background, for the population at large this relationship does not 
seem to exist. Of greater importance than the quality of the school, and 
second only to family background as a factor contributing to the quality 
of achievement, is a variable that may best be described as the sense of 
environmental control. Where pupils are high in this variable, school 
achievement tends to be high, irrespective of ethnic or economic level. What 
is not clear from this research, however, is the relationship between this 
environmental control variable and the actual home and school life con- 
ditions of the pupils. That is, it is not yet clear whether access to a good 
school and the opportunity to live in a “good community” are themselves 
associated with a high sense of environmental control. 


Although data on access to and opportunity in elementary and sec- 
ondary school are equivocal, there is no confusion in available research 
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with respect to higher education. The data consistently show that the 
rate of post-high school participation in institutional learning is disastrously 
lower for minority or lower status groups than for the more privileged. 
Similarly, the data also show that the quality of institutions generally avail- 
able to and utilized by a minority. group youngster is vastly inferior to that 
available to more privileged segments of the population. It is to be noted, 
however, that the literature reflects very rapid and dramatic changes in 
this area; but that rate of change is terribly modest in relation to the magni- 
tude of the need. 


Looking at the considerable body of research represented in this issue of 
Review, the previous issue devoted to the disadvantaged in March 1966, 
and the growing list of citations included in the ERIC system, one is inclined 
to be considerably more sanguine than is justified. This work does indicate 
that a number of serious and good workers are investing time and effort 
on one of the most important problems in education. However, a serious 
examination of these works reveals several conceptual and methodological 
problems: 


1. With rare exceptions the available research relating to the disad- 
vantaged treats the target population as if it were a homogeneous 
group despite the mounting evidence that heterogeneity within the 
several subgroups so designated may be a more crucial problem in 
educational planning. 

2. Similarly there appears to be a search for generic treatments or the 
one solution to the neglect of multiple solutions, individualization 
or the matching of treatment to specific characteristics. 


3. Studies in this area tend to depend excessively on quantitative 
measures and static variables to the neglect of the process variables 
and the qualitative analysis of the behaviors, circumstances and 
conditions studied. 

4. Too much research is directed at relationships between single 
variables despite increasing awareness that there are few if any 
phenomena which can be adequately explained on the basis of the 
interaction between only two variables. Too little attention is given 
to the examination of multiple interactions and multiple relation- 
ships in the genesis of behavior or behavior change. 


5. Particularly in the study of disadvantaged populations, investi- 
gators suffer from the tendency to view characteristics which differ 
from some presumed norm as negative and consider any correla- 
tion between these negative characteristics and learning dysfunction 
as culpable. This leads to a view of difference as some thing to 
overcome rather than a phenomenon with which to work. 
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6. In evaluating a treatment, the tendency is to infer from negative 
results that the treatment can not work when in fact there fre- 
quently was no effort to determine whether the treatment as pre- 
scribed was actually applied. The quality of many large scale 
programs is indeed questionable. 


7. In their effort to be rigorous in the use of control groups, investi- 
gators fail to see the extent to which adequate control is increasingly 
difficult given the complexity of our society or the extent to which 
control and experimental groups interact in incidental and indirect 
contacts. The result may be that the experimental group is acting 
directly or indirectly on the control group. 


8. Educational research has been dominated by a concern with hypo- 
thesis testing or verification to the neglect of investigation based 
on careful and systematic observation; this rather than theory test- 
ing is the immediate goal. In a field where the available leads are so 
few, the discovery of theory may be a more productive research 
strategy in efforts at better understanding and controlling the 
mechanisms of learning facility and learning dysfunction in 
heterogeneous populations characterized by differential backgrounds, 
opportunities and patterns of intellectual and social function. 


2: IMMIGRANTS AND THE SCHOOLS* 
David K. Cohen 


Harvard University 


It is hard to imagine anything more characteristically 
American than faith in the efficacy of schooling. Particularly since the late 
nineteenth century, public education has been viewed as an antidote for 
the diminishing equality of opportunity generally thought to be associated 
with cities, industry, immigration, and hardening class structure. 

This view of schooling is based on the idea that in advanced industrial 
societies occupational success depends upon intellectual competence. Al- 
though Americans are accustomed to the way that notion was expressed in 
Brown v. Board of Education and the Sputnik debates, Ellwood Cubberly 
(1909, pp. 18, 19) put it just as aptly when he wrote of industrialism: 


Along with these changes there has come not only a tremendous 
increase in the quantity of our knowledge, but also a demand 
for a large increase in the amount of knowledge necessary to 
enable one to meet the changed conditions of our modern life. 
The kind of knowledge needed, too, has fundamentally changed. 
The ability to read and write and cipher no longer distinguishes 
the educated from the uneducated man. A man must have better, 
broader, and a different kind of knowledge than did his parents 
if he is to succeed under modern conditions. 


The old idea that knowledge is power is extended to imply that it is the 
key to individual social and economic status. It is only a step from this 
to the view that schooling is worth money; although the identification of 
- knowledge with technical and economic progress was at least as old as 
Condorcet, only at the turn of the twentieth century was this association 
given a peculiarly American turn in studies of income returns to schooling. 
The other side of this is the argument that schooling can prevent social 
problems. It was put succinctly in 1917, by P. P. Claxton, then U. S. Com- 
missioner of Education (Ellis, 1917, p. 3): 
Comparatively few are aware of the close relationship between 
education and the production of wealth, and probably fewer still 
understand fully the extent to which the wealth and the wealth- 


*Support for the preparation of this manuscript was supplied in part by US.O.E. 


contract #6-10-240 with Yeshiva University. 
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producing power of any people depend on the quality and quantity 
of education. .. . Poverty is not to be pleaded as a reason for with- 
holding the means of education, but rather as a reason for supplying 
it in larger proportion. (emphasis added) 


The thinking behind this involved a few key ideas. One was that 
cities typically attract domestic or foreign peasant immigrants; education 
could prevent their being constrained, lumpen, at the bottom of the heap 
by offering paths to occupational attainment based on merit. Allowing 
the able to work their way up might reduce social tension and avoid class 
warfare.* A second was that providing minimal training to the poor would 
encourage punctuality, cleanliness, and respect, which would reduce crime 
and disorder; this would improve the quality of life for the laboring class 
and the quality of labor for the owning class. Finally, urban immigrants 
generally stood outside the mainstream of the national political culture; if 
the schools could teach them the language and the main features of the 
political system, the newcomers might then be expected to assume the re- 
sponsibilities of citizenship. 


These ideas form a rough general system—a popular idealogy of social 
reform—which has increasingly dominated educational thought and practice 
since the turn of the century. Since it holds that schooling is the best 
remedy for inequalities of social and economic opportunity, the ideology 


*The argument that the public school system should (or did) work on the basis of 
merit to promote occupational mobility can be found in widely disparate places, It is 
no surprise to find liberal school reformers making the argument, but E. L. Thorndike 
(1903, pp. 44-46), who thought environment | a trivial impact upon intelligence, 
ao ie ya cried should va oo all grad = tapan of Teure intel- 
igence. e ral argument is nicely illustrat e following exce rom one 
oe Frank Carleton’s essays (1907, pp. 71-79): j i 


The rapid growth of our cities has been a marked feature of our growth and 
development. The race must adapt itself to urban conditions. If the United 
States is to continue on its present course of advancement and progress the 
city must be made clean, healthy, moral, and it must be well governed. 


The great problems connected with the city . . . are at the root questions 
of education. The school must broaden the civic and social life of bS entire 
community. It must supply, or attempt to supply, those elements which also 
develop the new elements which our present civic, social, and industrial con- 
ditions necessitate. . . . 


If children are found in our crowded schoolrooms who are not readil 
amenable to the discipline there in force, it should be clear that the Aveo 
kind of training is not given them. Children from all kinds of homes and 
home environments should not be treated exactly alike, if good results are to 
follow our efforts. Financially—let the taxpayer take notice—it is more desir- 
able to treat the case now than later. ... These are not bad children; they 


are rather “morally sick.” Improper training and environment have made them 
what they are today. 
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assumes that adult social and economic status is determined on the basis 
of standards similar to those used to evaluate school performance: intelli- 
gence, order, discipline, and respect for authority. The ideology also implies 
that the desideratum of social reform is not the aggregate redistribution of 
social and economic status, but the maintenance of merit standards on the 
basis of which qualified individuals can effect a personal redistribution. 


Finally, a view of history is involved. Schooling, it is assumed, 
“worked” for immigrants who arrived from Europe around the turn of the 
century, but has not had the same effect for Negroes (Weinberg, 1969, p. 
28). The reasons advanced to support this account of events vary con- 
siderably. Some suggest that it was due to the fact that individual city 
schools were then politically and culturally more identified by (controlled 
by?) the immigrant groups they served, and some maintain that the quality 
of teachers’ commitment in the cities then was greater (Calhoun, 1969, p. 
73). Still others argue that immigrants did not meet the racial bigotry which 
Negro children face in city schools today (Weinberg, 1969, p. 28). But 
whatever the reasons, it is widely believed that although public education 
provided the means by which southern and eastern Europeans moved into 
the social, cultural, and political mainstream, it is not currently performing 
the same service for Negroes. 

An alternative view is that in many fundamental respects city schools 
related to working-class immigrants in much the same way as blacks. On 
this presumption, the educational problems which have erupted in city 
schools in the last decade—the cultural and linguistic content of curricula, 
the school staffs’ ethnic and racial composition, the effectiveness of educa- 
tion for poor children—are recent variants of an old problem: the inability 
of public education to overcome the educational consequences of family 
poverty, and to recognize the legitimacy of working class and ethnic 
cultures. 

The relative merits of these two interpretations of the educational 
experience of immigrant children can not be decided here. It is possible, 
however, to explore one particularly salient aspect of the issue—school 
performance. Did immigrant children do as well in school as native urban 
whites? Did they progress as far, achieve as well, and graduate as fre- 
quently? If so, one would be inclined to discount the idea that immigrants 
and Negroes fared similarly in city schools. If not, there would be a basis 
for further exploration of the similarities. 


The earliest direct evidence on these questions arises from surveys of 
school retardation carried out at the turn of the century. The appearance 
of these studies coincides with the entrance of large numbers of immigrant 
children in city schools. Quite a few efforts were made in the first decade 
of the century (Corman, 1908; Falkner, 1908; Greenwood, 1908), but the 
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first large-scale survey involving immigrants was published by Leonard 
Ayres (1909). It covered more than fifty American city school systems, in 
an effort to determine the extent of retardation. 

Ayres found enormous variation among city school systems. Only 
18%, of all the students in Boston’s public schools, but nearly 60% of those 
in Cincinnati’s, were retarded; the average seems to have been around 30% 
(Ayres, 1909, p. 45, Table 22). But such comparisons are valid only if 
the underlying phenomenon is the same in all cases, If cities followed dis- 
similar promotion practices, the variation among retardation rates would 
reflect these disparate practices and the results would not be comparable. 

This problem seems to have escaped Ayres, for he presented no evi- 
dence on it; he did, however, conduct a depth study in New York City, and 
it seems reasonable to presume more uniform promotion policies in one 
city. He collected the records of 20,000 children from 15 public ele- 
mentary schools. The analysis revealed that slightly more than 23% of all 
students were at least a year behind the expected grade for their age (Ayres, 
1909, p. 107). 

Ayres also pursued the relationship between nationality and retardation 
in this depth study. The results of computing retardation rates by national 
origin are displayed in Table | (Ayres, 1909, p. 45, Table 22). It reveals 
that retardation was twice as great for Irish and Italian children as for 


TABLE 1 


GRADE RETARDATION IN FIFTEEN NEW YORK CITY 
ELEMENTARY SCHOOLS, BY NATIONALITY, 1908* 


Per Cent Of Students 
Retarded At Least 
Nationality One Grade 
German 16 
American 19 
Russian 23 
English 24 
Trish 29 
Italian 36 


students of native or mixed parentage, but it sh 


ofn ows even greater variation 
among immigrant groups themselves: 


children of German parents were 


*Ayres reports that he also tabulated the results separately for each school, i 

to determine whether local school conditions (such as type of fea bkohana sini 
policies, or predominant nationality, for example), produced variations in the distribu- 
tion of retardation rates. He did not report the results, however, 
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less often retarded than any other group, including native white Americans. 
Although it is easy to imagine reasons for such variations—ethnic differ- 
ences in duration of stay and language acquisition, or social class, or both, 
Ayres’s data were not amenable to exploring these questions. The evidence 
tells us only that retardation was severe for some immigrant groups, mild 
for others, and on the whole somewhat higher for immigrants than for 
native Americans. 

Another study of retardation in New York City secondary schools 
was carried out at about the same time by J. K. Van Denburgh (1911), a 
member of the Teachers College faculty. Although the extremely selective 
character of secondary education at the turn of the century means that 
the results must be approached with caution, the rank order of nationalities 
for high school completion is roughly the same as that for elementary school 
children. Van Denburgh’s data (1911, p. 45) permit computation of reten- 
tion rates (percent of those entering high school who graduated) for several 
nationality groups. The retention rates were 1%, and 0% for Irish and 
Italian children, respectively, 10% for native whites, 10.8% for those from 
Britain, 15% for those from Germany, and 16% for Russian children 
(who Van Denburgh and Ayres thought were almost exclusively Jewish). 

These issues were explored in a larger and more precise study of schools 
and nationality carried out in 1908-09, and published by the U. S. Immi- 
gration Commission (1911). The survey covered all schools, students, and 
teachers in thirty cities (twenty of which were the country’s largest in point 
of population), and produced information on 2,036,376 pupils. Its estimate 
of retardation was based on a less liberal definition of age-grade norms 
than Ayres’s, and thus the results suggested a greater average retardation 
rate (36%) than given in the New York depth study (U. S. Immigration 
Commission, 1911, p. 31).* Retardation for native American white children 
was 28% as against 40% for children of foreign-born parents. This con- 
siderable difference was accentuated when language variations were taken 
into account; children of immigrant parents from English-speaking coun- 
tries were no more often retarded (27%) than children of native white 
parents, but more than 43% of immigrant children from non-English- 
speaking countries were retarded (U. S. Immigration Commission, Vol. I, 
p. 31). 

Since the Immigration Commission used a uniform measure of re- 


*Ayres’s (1909, p. 107) definition of retardation was as follows: “All children up to 
the age of nine years were considered as of normal age for the first grade. Ten was 
the limit in the second grade, eleven in the third, and so on.” The Immigration Com- 
mission (1911, Vol. I, p. 31), on the other hand, defined a retarded pupil as one 
«who is 2 or more years older than the normal age for his grade. Thus, a pupil 
is retarded if 8 years or over in the first grade; 9 years or over in the second grade...” 
By Immigration Commission standards, then, Ayres’s result substantially understated 
retardation rates. 
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tardation in all the cities studied, these results seem reasonably solid. 
Retardation in city schools was nearly twice as severe for those immigrant 
children whose parents arrived from non-English-speaking countries as 
it was for native urban whites. The evidence for immigrant children of the 
first generation, at least, is that they had no easy time in city schools. 


The difference in retardation rates between children from English 
and non-English-speaking countries suggests that variations in exposure 
to the language and culture may have affected retardation. This idea did 
not escape the Immigration Commission’s research workers; although they 
did not carry out a longitudinal study of children, they did tabulate re- 
tardation by several variables which measured exposure. Father’s citizenship 
status, child’s place of birth, child’s age upon arrival in the U.S., and 
language spoken in the home all revealed an inverse relationship between 
retardation and exposure.* The most dramatic comparisons arose from a 
variable which seems to have measured the length of time which the 
child had been exposed to the American language and culture; the results 
for several immigrant groups are displayed in Table 2 (U. S. Immigration 
Commission, Vol. I, p. 32). 


TABLE 2 
RETARDATION IN SCHOOL AND BIRTHPLACE OF STUDENT 
Nationality Per Cent Retarded 
Born 
in City Born 
Surveyed Abroad 
Native white Ce TENS Pica ba Bh A 
English 24.3 29.9 
German 31.3 51.0 
Russian Jews 29.6 59.9 
Italian 57.0 76.7 
Irish 27.6 54.8 


In a sense, the most interesting aspect of the table is not that exposure 
affected retardation, but that it seemed to have had differential effects 
among the ethnic groups. The rate for Russian Jews and the Irish was cut 
almost precisely in half (down to the average for native urban whites) by 
controlling exposure, while for Italians the rate fell by slightly less than 
one third. It also appears that the exposure variables measured more than 


*All of the following data from the Commission’s report were based on an intensive 
sub-study of 62,321 pupils in twelve cities, for whom more detailed information was 
gathered (U. S. Immigration Commission, 1911, pp. 27, 172-77). 
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linguistic skills: Irish children born abroad, but in an English-speaking 
country, were twice as likely to be retarded as Irish children born in the 
United States. It is easy to imagine that this could result from cultural 
variations, of differences in acculturation, but the Immigration Commission 
—as well as other observers—pointed to social and economic class differ- 
ences between earlier and later immigrants. Those arriving around the 
turn of the century, whose children were less likely to have been born in 
the United States, were generally believed to have been poorer and less 
well educated (Hutchinson, 1956, pp. 64-66). 


Did rates of retardation for immigrant children decline as the century 
wore on? Several smaller studies offer evidence on this point. One was 
undertaken in New York City in the early 1930's by J. B. Maller, a member 
of the Teachers College faculty. He surveyed all the city’s elementary 
schools (Maller, 1931), and computed school progress rates for several 
nationality groups. Although his ethnic groups do not always correspond 
to those in the Ayres study, the available comparisons are worth consider- 
ing. Where Ayres found an over-all retardation rate of about 23%, 
Maller found one of 29%; where Ayres found a retardation rate of about 
35% for Italians, Maller found 35%; where Ayres found a retardation rate 
of about 23% for Russians (Russian Jews, apparently), Maller reported a 
retardation rate for Jews of 25%.* Although it would be unwise to attach 
much importance to any one of these numbers, the over-all similarity in 
their order of magnitude is striking. In New York City, at least, there may 
have been little change in retardation rates for foreign-born children 
between 1900 and 1930. 


Another effort was carried out in the public elementary schools in 
Minneapolis and St. Paul (Jordan, 1919). Some of Jordan’s results are 
presented in Table 3 (1919, pp. 36-7). The first column contains results 
only for children of unmixed parentage; it reveals substantially greater 
retardation for immigrants than native whites. The Immigration Commis- 
sion’s results (1911, Vol. 4, p. 105) for Minneapolis (dating from 1908) are 
displayed in the second column, They are little different than the rates for 
the same ethnic groups eleven years later. 


The third column of the table is interesting for other reasons. It dis- 
plays the rates of retardation for children of “mixed” parentage; such 
children, of course, were less likely to be insulated in ethnic subcultures. 
The percentages reflect this, for the retardation rates for “impure” nation- 
ality were much lower than those for “pure” nationality. This suggests— 
as did the Commission’s comparisons of children born here and abroad—that 


*Maller did not specify the criterion of school progress, save that pupils making slow 
progress were those the New York schools classified as “retarded.” This seems to in- 
clude all those a grade or more behind expectation for their age. 
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TABLE 3 


RETARDATION RATES FOR ELEMENTARY SCHOOL PUPILS, 
MINNEAPOLIS AND ST. PAUL, 1908 & 1919 


“Pure” Percent Retarded “Impure” 

Nationality* Immigration Nationality 
Nationality (1919) Commission (1908) (1919) 
Native White 32.0 37.2 +x 
German 42.9 46.2 31.2 
Swedish 41.2 41.4 28.0 
Russian Jews 45.5 39.4 18.7 
Polish ee meee Eh UE ere. 
Italian 61.1 


assimilation had something to do with school progress. The continued 
severity of retardation for Italians and Poles is a case in point; it may be 
explained partly by the fact that they were less likely to assimilate than 
other ethnic groups. The table indicates, for example, that there were insuffi- 
cient cases of mixed marriages in these two groups to compute retardation 
rates. 

Research on the intelligence of immigrant children—most of which was 
undertaken in the decade following World War I—provides further indirect, 
evidence on school retardation among immigrants. The studies involved 
a wide variety of tests and elementary schoolchildren of several ages and 
places; yet the rank order of nationalities did not vary. One study of New 
York City schoolchildren, for example, yielded the median IQ scores dis- 
played in Table 4 (Murdoch, 1920).** Another study of California’ fifth 


TABLE 4 


MEDIAN IQ SCORES FOR NEW YORK CITY 
TEN-YEAR-OLDS, 1919 


Nationality Median IQ WN 


een ne a be ard y L a 
Native White 108.5 48 
Italian 84.3 28 


*The “Pure” nationality designations refer to children whose parents and randparents 
were from that nationality group, and the “Impure” designation to children whose 
parents and grandparents had intermarried with Americans, or members of other ethnic 
groups. 


Goa the IQ scores for the youngest children are presented, to minimize selection 
effects, 
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graders of about the same age range (Young, 1922) showed that the Median 
IQ for Native Whites was 110, and 85 for children of Italian-born parents. 

Perusal of these studies reveals that the debate over immigrant intelli- 
gence parallels quite precisely the debate over Negro intelligence. One of the 
main issues was whether the differences were genetic or environmental; the 
other was whether the tests were culturally and linguistically biased against 
the immigrants. Although the measurement of intelligence and environ- 
ment has improved somewhat since the 1920's, the terms of this discussion 
have changed not a bit. What happened was that immigration—and thus 
immigrant intelligence—ceased to be a relevant political concern. 

Other studies report roughly similar findings in comparisons of these 


groups during the 1920’s (Young, 1922; Brown, 1922; Pintner, 1923; 


me 


_over-represented in the freshma 


Mead, 1927; Seago and Kolodin, 1925; Colvin and Allen, 1925; Pintner and 
Keller, 1922; Brigham, 1930). Although IQ is not the same thing as re- 
tardation, the two were not unrelated: research on New York City’s ele- 
mentary schools in 1930 found that the correlation between school average 
retardation and school average IQ was .698 (Maller, 1933). 

Although much less research seems to have been carried out in sec- 
ondary schools the IQ differences persisted at the high school level. One 
study (Feingold, 1924) of the Hartford, Connecticut high school in the 


early 1920’s revealed the IQ differences displayed in Table 5. 


TABLE 5 

NATIONALITY AND IQ, HARTFORD HIGH SCHOOL FRESHMAN 
E eee PTC 
Nationality IQ Nationality 1Q 
Scotch and English 105 French 98 
Native White 103 Irish 98 
Jewish 103 Polish 97 
German 103 Italian 97 
Scandinavian 102 


Scandinavian: M or E a 
They are not as great as those for elementary schoolchildren, but this was 
probably due to the greater selectivity of secondary schools. 

The Hartford research also illuminated the relation between ethnicity 


and high school selection. There appear to have been only small differ- 


ences among ethnic groups in the likelihood of entering high school: the 


ethnic proportions within the freshman class correspond quite closely to 


the ethnic proportions in the entire city population. Jews were slightly 
n classes, and native whites slightly under- 


. represented, but these differences were small. Staying power, however, was 


rather a different question. Table 6 presents the ratio between high school 
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freshmen and juniors, for each ethnic group. Thus, the numbers in the last 
column can be read as probabilities—that is, as the number of chances in 
one hundred that a freshman from each of the groups had of reaching the 
junior year.* 


TABLE 6 


NATIONALITY AND HIGH SCHOOL RETENTION, 
HARTFORD, CONN. 


Per Cent Juniors 


Nationality Freshmen Juniors of Freshmen 
Native White 892 572 64 
Jewish 518 416 80 
Irish 278 88 34 
Italian 206 58 28 
Scandinavian 114 56 48 
Polish 90 22 24 
German 86 40 44 
English and Scotch 76 34 44 
Total 2,260 886 38 


Over-all, the chances of students remaining in school until the junior 
year were slightly less than four in ten, but there was enormous variation 
by nationality. Polish and Italian students had about 2.5 chances in ten of 
remaining until the junior year, whereas native whites had over six chances 
in ten. The Irish were a bit below the average, and the Germans were 
slightly above it. Jews who entered the freshman class had eight chances 
in ten of staying till their junior year, better than twice the city average. 

It would be nice to know how much these differences owed to variation 
among ethnic groups in intelligence or inherited social and economic status, 
but there are no data on the students’ social and economic status. It is 
possible in some cases to get a rough idea, however, of the differences that 
could not have been due to IQ. Table 5 showed that the mean IQ for 
English and Scottish children was 105, and 103 for native whites, Jews, 
and Germans. Yet Jews were twice as likely as Germans, Scots, and English 
pupils to be in the Junior class three years later, and half again as likely 
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as native whites. Thus, group-to-group variations in IQ seem unrelated 
to group-to-group variation in school retention. Although one might 
argue that these first-generation Jewish families had a social and economic 
edge on the Germans, it seems doubtful; what is more, it would be slightly 
fantastic to assume that the Jews had such an advantage over native 
whites. The differences in staying power in this case seem much more 
likely to result from variations in culture and motivation than from intel- 
lectual or social and economic differences. 

Nonetheless, this goes only a small way toward assessing the relative 
impact of ethnicity and class on immigrants’ educational attainment. Un- 
happily, at the moment, there is no really satisfactory direct way of exploring 
this. My search of the literature found only three studies which considered 
both factors at once. The results of one of these (Young, 1922) lonely and 
limited efforts, concerning Italians and native whites, are displayed in Table 
7. Although the mean differences are in some cases narrowed by taking 
father’s occupation into account, they are by no means eliminated. The 


TABLE 7 


NATIONALITY, SOCIAL CLASS, AND INTELLIGENCE 
CALIFORNIA TWELVE-YEAR-OLDS 
‘ (ALPHA SCORES BY TAUSSIG OCCUPATION SCALE) 


Father’s 


Professional 83.35 CBIR EA mer TV Peseta 
Semi-prof. & 

business 67.30 (100) 40.70 (25) 
Skilled workers 54.75 (120) 36.06 (32) 
Semi-skilled 41.60 (51) 35.92 (37) 
Common labor 48.60 (18) 19.57 (102) 


Oe a eee 


Mean 60.40 (316) 28.20 (196) 


results suggest that ethnic differences in the distribution of occupational 
Status accounted for some, but by no means all of the variation among 
ethnic groups in educational attainment. 

Another of the studies on the class-ethnicity issue is Arlitt’s (1921). 


_ This study involved 343 primary grade children from a single unidentified 


school district. The comparisons were between native white and Italian 
children: the Taussig scale of occupations was employed, and the IQ’s were 
on the Binet Scale. The results are displayed in Table 8. 
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TABLE 8 
NATIONALITY, CLASS, AND IQ 
rr oS Median IQ for MedeniQicc — 
Entire Ethnic Group Lowest Two SES Groups 
Native white 106.5 92.0 


Italian 85.0 85.0 


The first column gives the simple comparison among the two groups; the 
second column takes father’s occupation differences into account to some 
extent. It displays the median IQ’s for those children whose fathers were 
semi-skilled and unskilled laborers; the comparisons could only be carried 
out for this group, since no Italian children had fathers with other occupa- 
tions (see also Bere, 1924). Controlling for father’s occupation very sharply 
reduced the pure ethnic differences in IQ. 

Although this reveals that simple ethnic comparisons were quite mis- 
leading, it does not resolve the issue. It is as easy to believe that further 
controls for class and urbanism would have eliminated the ethnic differ- 
FES as to believe that much of the variation was due to culture, not 
class. 

In summary, then, although the evidence I have presented is frag- 
mentary and often non-comparable, it suggests that in the first generation, 
at least, children from many immigrant groups did not have an easy time 
in school. Pupils from these groups were more likely to be retarded than 
their native white schoolmates, more likely to make low scores on IQ tests, 
and they seem to have been a good deal less likely to remain in high school. 
It also appears that children of first-generation immigrants from these 
groups had as difficult a time in the 1920’s or 1930’s as their predecessors 
experienced during the first decade of the century. 

It must be equally clear, however, that being the son or daughter of 
an immigrant did not in itself result in below-average educational attain- 
ment. Children whose parents emigrated from England, Scotland, Wales, 
Germany, and Scandinavia seem to have generally performed about as well 
in school as native whites; certainly their average performance never 
dropped much below that level. The children of Jewish immigrants typi- 
cally achieved at or above the average for native whites. It was central 
and southern European non-Jewish immigrants—and, to a lesser extent, 
the Irish—who experienced really serious difficulty in school. On any 
index of educational attainment (whether it was retardation, achievement 
scores, IQ, or retention), children from these nationalities were a good deal 
worse off than native urban whites. 


Perhaps the most interesting question this raises involves the origin 
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of these ethnic differences: did they arise primarily from group differences 
in inherited social and economic attributes or were they chiefly the con- 
sequence of differences in culture and motivation? At first glance, the 
second seems a likely alternative; after all, the main over-achievers—the 
Jews—typically placed a great value on education. But there is more to it 
than that, for there is evidence which suggests that the rank order of intelli- 
gence among immigrant groups would co roughly to their rank 
order on an index of urbanization. This is clearest if one compares the 
Italians (most of whom emigrated from southern Italy) and the Poles 
with immigrants from Germany, or with the Jews. It is interesting to note, 
in this connection, that there were very great differences among the Jews 
according to nation of origin. The U. S. Immigration Commission a911, 
Vol. I, p. 31) found that 37% of German Jewish children experienced 
school retardation, as against 41% for the Russian Jews, 52% of the Ru- 
manian Jews, and 67% of the Polish Jews. These proportions closely 
resemble those for non-Jews of those nationalities. In addition, there is 
some evidence that among the immigrant groups, those whose children 
achieved well stood somewhat higher on the occupational scale. Bere 
(1924), for example, presents the following distribution of occupational 
classes on the Taussig scale, for Italian and Jewish fathers in her New 
York City study. 


TABLE 9 
CLASS, BY NATIONALITY 


Se 


Per Cent in Each Class 
Occupational Class Italian Jewish 
Professional) GuaenN ae ea a a Ga 
Semi-Professional 6.75 13.04 
Skilled 36.48 34.78 
Semi-Skilled 17.56 45.65 
Unskilled 39.19 6.52 


Another important issue concerns the schools’ response to the immi- 
grants, The arrival of large numbers of immigrant pupils coincided with 
the emergence of IQ and achievement testing, vocational guidance, and the 
movement to diversify instruction and curriculum in city schools. There 
is more than a little evidence that these practices were employed—if not 
conceived—as a way of providing the limited education schoolmen often 
thought suitable for children from the lower reaches of the social order. The 
tension this suggests also extended to the schools’ culture: there is no evi- 
dence of any effort to employ the immigrants’ language and culture as 
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educational vehicles. I have been unable to find any hint that cultural 
diversity was entertained as a serious possibility; it appears that the WASP 
culture reigned supreme in urban public schools. In this connection, it is 
important to note that there appears to have been a substantial movement 
to create educational alternatives among some immigrant groups. For 
the Irish and Italians, of course, the Catholic parochial schools served this 
function, as did part-time religious schools for the Jews. There also were 
efforts—among the Bohemians, for example—to establish part-time 
“language schools” as a way of maintaining and transmitting the culture. 

Finally, there is the question of schooling and social mobility. I have 
shown that there was a good deal of variability in immigrant children’s 
educational attainment: some groups did as well or better than the aver- 
age for native urban whites, and others much worse. But to show that 
the children of many immigrant groups had difficulty in school is not 
to show that education turned out to be a less effective way for them to 
climb the social and economic ladder. Almost all the results I have pre- 
sented are based on evidence about the children of first generation immi- 


This, however, is another part of the story, and like the other questions 
I have raised, it requires more attention than is possible here. 
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Individuals and groups vary in measured intellectual performance at 
different times, and here we attempt to order knowledge about this mutabil- 
ity, particularly as it may bear on the distribution of mental retardation. 
Our object was to find a secure basis for statements about the mutability of 
intelligence and to set those statements out as testable propositions. Such 
propositions can find application in practice if they are valid; they can 
be rejected and supplanted by others if they are invalid; and they can be 
claborated and specified if they are too general. 

There is no need here to enter into controversy on the heritability of 
intelligence (Conway, 1958; Burt, 1961; Jensen, 1969). Current genetic 
models emphasize that the relative contribution of heritable and environ- 
mental factors is by no means fixed and may change with changed en- 
vironmental conditions (Edwards, 1969). Identical twins show a high 
degree of concordance for many characteristics, but for the epidemiologist 
it is often less fruitful to examine the circumstances in which they resemble 
each other than the circumstances in which they are dissimilar. 

The heritable components of a characteristic that has a continuous 
distribution, like IQ, are assumed to result from multiple genes that are 
polymorphic; each polymorphic gene may express itself in different forms, 
each form in different degrees. The multiple forms of expression allow for 
subtle and complex interaction with the environment. Current genetics, in 
common epidemiology, uses models of multiple causality. Even where the 
heritable component of a particular characteristic appears to be large, the 
models are compatible with dramatic changes in the frequency of the char- 
acteristic produced by the environment. Tuberculosis and rheumatic fever, 
diseases for which a genetic element has seemed established, have dwindled 
to a small fraction of the rates that existed a half-century ago. Similarly, 
the heritable component for height and weight has been estimated to be 
larger than the heritable component of IQ scores, yet this century has seen 
considerable increases in the height and weight of the populations of in- 
dustrial societies. 


*Support for the preparation of this manuscript was supplied in part by US.O.E. 
contract #6-10-240 with Yeshiva University. 
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An impressive number of studies established that the intelligence quo- 
tient, once thought to be a constant throughout the life of an individual, 
fluctuates substantially. For the clinician, the inconstancy of the IQ in an 
individual may be a hindrance; it reduces the predictability of future scores 
from known scores. For the epidemiologist, however, the inconstancy of 
the IQ in individuals and in populations opens up a new field of study; 
inconstancy that proves to be systematic and not random can yield insights 
into its causes and into the factors that give rise to high or low levels of IQ. 
At the lower end of the IQ scale, the existence of systematic variation in 
IQ may also point to the existence and the causes of systematic variation 
in mental retardation. 


In the studies included in the discussion that follows, IQ test scores 
are treated as a measure of selected aspects of intellectual performance at 
one point in the life course. We chose IQ as a dependent variable with 
deliberation and in awareness of its sketchiness as a measure of the range 
of individual potential. Compared to most measures employed by epi- 
demiologists, intelligence tests are highly standardized and reliable. Thus, 
from the epidemiological point of view the IQ provides an objective and 
available measure, and it can be used to estimate the capability of groups 
to meet the intellectual demands of the occupational and social roles of 
industrial societies. Few researchers will deny the concurrent validity of the 
IQ for estimating performance in school and similar settings. At the same 
time, most will admit deficiencies in the predictive accuracy of the IQ; 
it is our purpose to seek out systematic departures from predicted values. 


The studies reviewed were selected first for relevance and second for 
rigor; most of them have unavoidable and avoidable weaknesses in method. 
Caution is therefore called for in interpreting them. 


Heterogeneity of method characterizes the studies covered in this re- 
view. The investigators varied in their choices of design, of units of observa- 
tion and of variables of concern. Although variation in approach and 
methods is often a trial to one attempting a synthesis, it can be a strength 
because a consistent result obtained with inconsistent methods increases 
logical validity. Varied approaches and methods serve to define the limits 
to which the results can be generalized. The isolated study may lead to 
interpretations which are at times more restrictive, but at other times more 
general, than the perspective obtained from a number of studies would 


allow. Juxtaposition clarifies the circumstances and the groups to which 
the results may apply. 


Below we set down propositions that either held up under our scrutiny, 
or that we believe will hold up when subjected to further testing. In the 
body of the paper that follows, each proposition is discussed in more detail. 
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Propositions 


1. IQ scores of certain social groups vary between age groups. 

1.1 The intellectual performance of schoolchildren from groups at a 
marked social disadvantage declines as they grow towards puberty. 

1.2 Children who have suffered from cultural retardation make IQ 
gains in early adulthood.* 

2. IQ scores of children who are either mentally retarded or at a marked 
social disadvantage change in directions predictable from certain social 
experiences or social stimuli. 

2.1 Declines of IQ with age among those at a social disadvantage can 
be attributed in part to inadequate schooling. 

2.2 In mentally retarded children, IQ change accompanies change in 
residential setting, and the direction of change depends on the type of 
setting. 

2.3 In mentally retarded children and in children at a marked social 
disadvantage, IQ gains follow specially designed educational, social and 
medical interventions. 

3. The duration of favorable effects of social experience on children’s 
IQ depends on the type of experience. 

3.1 The advantage in IQ that groups of preschool children in short- 
term pedagogic programs achieve over comparison groups is not maintained 
at the peak level. 

3.2 Enduring gains in IQ follow physical removal of children from one 
social milicu to another. 

4. Intervention effective in producing IQ change has not been limited 
to a well-defined critical period. 
5. The IQ level of a majority group limits the extent’ of IQ gains that can 
be induced in subgroups which share the educational and social experiences 
of the majority. 
6. The more unfavorable are the social and educational origins of a group, 
the greater is the potential that exists for change in IQ. 
T. Among groups at a marked social disadvantage, changes in mean 
IQ with age correspond with changes in the frequency of mild mental 
retardation. 
8. Among populations in defined geographic areas, improved environment 
has produced a rise in IQ levels and a decline in the frequency of mental 
retardation, 

IQ Scores of Certain Social Groups Vary Between Age Groups. 
— 
*Cultural retardation is a term used here to describe a condition of mild mental re- 
tardation which occurs in individuals who have no detectable brain lesions and who 
come from groups at a marked social disadvantage. 
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Intellectual Performance of Schoolchildren From Groups at a Marked 
Social Disadvantage Declines as They Grow Toward Puberty. 

Wheeler (1942) studied Appalachian schoolchildren in Tennessee ant 
drew attention to a gradient in IQ declining from younger to older age 
groups (Fig. 1). The gradient appeared in 1930 and 1940. Such a fall-off 
in the IQ of disadvantaged children in the older age groups at school had 
been noted earlier by Gordon (1923), who observed that the children of 
canal-boat workers in the South of England had poor school records of 
attendance and performance, although they did not seem badly cared for 
at home. Gordon predicted that the ill effects of poor attendance on their 
intellectual performance would be cumulative as they progressed through 
the school grades. Figure 2 shows that among the somewhat small numbers 
studied there was indeed a lower level of IQ among the older age-groups. 
It is not reported whether or not the tester was blind to the research 
hypothesis. The groups of children in younger and older age groups seem 
to have been comparable in attributes and social experience, and not subject 
to obvious bias in selection, since they were assembled from sibships. 


N 33 62 60 94 99 
1930 102 107 109 

MON 94,7 90.9 88.9 864 84.3 80.0 81.4 77.6 142 BA AS a 
1940 N 188 244 322 324 383 358 383 319 257 HG 34 

MON 1026 99.9 99.2 96.4 91.4 93.9 90.2 87.8 851 81.3 80.0 m 


Fig. 1. Median i i i 
A TTE 

f Gordon’s results, like Wheeler’s, were cross-sectional; they were ob- 
tained at one point in time for each individual and for the whole group. 
While recognizing problems of research design, one may reasonably assume 
that Gordon would have found a decline in scores in a longitudinal follow 
up of children. Gordon studied a second group of children, gypsies who 
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Fig. 2. Mean Mental Ratios, by Age, of all the Schoolgoing Sibs from Canal Boat and 
Gypsy Families (Sexes Combined). Constructed from Gordon (1923). 


were less well-cared for but more regular school-goers than canal-boat 
children. Because of their somewhat better attendance, Gordon predicted 
that lower levels of intellectual performance would again be found among 
older age groups, but that the depression would be less marked. The results 
fitted the prediction. 

The trend towards lower levels of intellectual performance in older 
age groups, among children at marked social disadvantage, holds for 
developmental scores and school performance as well as for IQ (Osborne, 
1960; John, 1963). In a study by Knobloch and Pasamanick (1961), of a 
Baltimore cohort selected at birth and followed through five years of age, 
black children were at a clear social disadvantage compared with white 
children. The investigators found a gap between the developmental quo- 
tients of black and white children that widened with age. In infancy, 1.5% 
of the cohort had DQ’s below 80 and there was no significant difference 
between black children and white children. At three years of age, 
DQ’s below 80 were found in 7.1% of black children and 1.4% of white. 

Although the ethnic factor could not be completely controlled, the 
result is consistent with other effects of a harsh and poor social environment. 


The more recent results obtained by Kennedy, Van de Riet and White 
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(1963) were similar to those of Gordon and Wheeler. In a cross-sectional 
survey of 1800 Negro elementary-school children in the southeastern Unit- 
ed States, the IQ’s for older age groups were lower than for younger age 
groups (Table 1). These results can not be cited without reservation in 
confirmation of proposition 1.1. When the mean IQ’s for each grade were 
compared, there was little change as between higher and lower grades 
(Table 2). Schaeffer (1965) suggested that an artifact of the sampling 
method could have caused discrepancy between age-specific and grade- 
specific mean IQ trends. Children who enter elementary school at five 
and six years rather than later are young for their grades, and usually 
brighter than their age-peers; children of 12 and 13 years who remain 
in elementary school rather than pass into secondary school are old for 
their grades, and usually duller than their age-peers. The data are com- 
patible with the pattern expected from such selection: mean IQ’s were 
higher for five and six year old children in the elementary schools and 
lower for 12 and 13 year old children, but did not vary between the 
intermediate grades. Nonetheless, scores on the California Achievement 
Tests did not show these patterns; these scores fell increasingly below 
the national average in the higher age groups and in the higher grades. 


TABLE 1 


Stanford-Binet IQ in 1800 black schoolchildren in the 
Southeastern United States, by Chronological Age. 


Age Number IQ Mean SD 
5 years 86.00 6.40 
6 years 84.43 12.48 
T years 81.71 11.80 
8 years 80.88 12.04 
9 years 80.10 12.08 
10 years 80.10 12.68 
11 years 80.63 11.98 
12 years 75.48 11.34 
13 years 65.23 10.45 
14 years 66.11 7.35 
15 years .. LU ES Na" fie 
16 years mess 51.00 e ae 
All ages 80.71 12.48 

Adapted from: Kennedy, Van De Riet, and White 


(1963) 


In a subsequent longitudinal study, Kennedy (1969) retested b- 
sample of 132 children after four and one-half years. a aid 
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89% of 360 children of the original sample, all those from one area, At 
follow up, the mean IQ scores were not lower than the initial mean for 
the 312 of the subsample retested, nor than the mean for the whole 
sample of 1800. A limitation on analysis is that the performance of the 
individual children and of subgroups can not be compared at different 
ages, because the additional data given are correlations between the mean 
scores of the children at entry and at follow up only by schoolgrade, A 
bias in the result is that the losses from the sample included those chil- 


TABLE 2 


Stanford-Binet IQ in 1800 black schoolchildren in the 
Southeastern United States, by schoolgrade. 


Grade Number Mean SD 
First 

Mole ........ i. es 160 81.28 11.76 
A 140 82.43 13.12 

300 81.81 12.43 

160 81.50 12.51 

140 79.10 12.99 

300 80.38 12.79 

144 81.00 11.16 

156 79.57 12.51 

300 80.25 11.90 


Male. oasian din cena 151 80.25 12.00 


Adapted from: Kennedy, Van De Riet, and White 
(1963) 
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dren from the poorest families and with the lowest test scores. Further 
analysis of these data is needed, therefore, to interpret their meaning in 
terms of IQ change with age. The results of the California Achievement 
Test scores, in Kennedy’s longitudinal study, like those of his cross-sectional 
study, showed a fall off in performance at follow up. 

The apparent conflict between the results of IQ tests in cross-sectional 
studies with Kennedy’s longitudinal study could reside in several factors. 
The cross-sectional comparisons between age groups are subject, first, to 
the problems of comparability between groups not assigned at random, 
for example the social selection among samples of school children who 
enter and leave school at ages that differ according to social background. 
Second, they are subject to problems of change over time, for example 
the variable historical experience of successive cohorts, including expe- 
rience with tests. True age differences in intellectual performance at the 
same point in time can arise from the effects either of maturation with 
age or different durations of exposure to a particular environment. Ap- 
parent age differences in performance can arise from the different com- 
position and different environmental experiences of successive cohorts. 
The short time-span between the successive cohorts of elementary school 
children included in cross-sectional surveys, however, is unlikely to pro- 
duce differences between cohorts sufficient to account for the differences 
between age groups. Longitudinal comparisons, too, are subject to social 
selection that affects sampling of age groups and the loss from samples, 
to such test effects as sensitization and practice that could have improved 
the scores of the follow up samples, and to the statistical effects of re- 
gression to the mean. 

In sum, a number of studies, although not all, show that in the 
poorest social circumstances a decline in IQ occurs in older age groups, 
and all show a decline in educational attainment. These differences be- 
tween age groups are more likely to reflect a decline in performance with 
age than a rise in performance among the later and younger cohorts. 
Such a decline in performance with age can be taken to reflect a cumu- 
lative continuing effect of environment on development. 


Children Who Have Suffered From Cultural Retardati 
IQ Gains in Early Adulthood ere ee 


The downward trend in IQ among children at a marked social dis- 
advantage probably comes to a halt in the period after puberty. The data 
available for the analysis of IQ trends after puberty among socially handi- 
capped people come from studies of mentally retarded subjects, 

Clarke and Clarke (1957) observed, over se 
moderately retarded persons admitted consecutiv 
mental deficiency outside London. The subject: 


veral years, a cohort of 
ely in an institution for 
s came to the institution 
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from families and from a variety of other institutions. They made con- 
siderable IQ gains (Table 3). At entry to the study, the subjects average 
age was 22 years, and they continued to make gains into their late 
twenties, There were no losses from the initial sample at follow up. 


The IQ gains found in these retarded young adults were larger than 
would be expected from a statistical regression effect and a practice effect 
combined. The limits of the size of such effects are set by the reliability 
of the IQ tests. A control experiment, in which a sample of retarded 
subjects was retested a short time after an initial test, yielded an increment 
of only four points, compared with 15 points in the main study. The 
IQ gains of the individuals in the main study can therefore be attributed 
either to intellectual maturation that continued into adulthood (having 
earlier been retarded by their childhood environment) or to environ- 
mental stimuli acting during the years of observation. 


TABLE 3 


1Q Change In Young Adult Mentally Subnormal Persons In An English Mental Deficiency Institution, By Home 
Environment: Mean Scores (Wechsler, Form 1) Over 6 years 


==- am 


— 


Total 
Diit, 
199 1s 1955 199/1955 
Group from most adverse homes: N =9 
Mean score 506 +97 (204)* 707 = 1A (4) 758 > 103 (264) 162+ 
Group from less adverse homes: N = 19 
Mean score 623 + 134 (219) 6S + 13 (249) 725 = 133 (79) ne 


Total: N=% 


*Mean chronological age in parentheses. 
‘P <001 (one-tailed t test). 
Adapted from: Clarke, Clarke, and Reiman (1958). 


The contribution to these IQ gains of environmental stimuli acting 
during the period of observation was rendered doubtful by a subsequent 
experimental study. Clarke and Clarke and Reiman (1958) placed a 
randomized group of the young adult mentally subnormal patients from 
the mental deficiency institution in a special rehabilitation unit. There 
they were given at least six months of vocational training and six months 
of work in the community. After a year in the unit, the specially trained 
subjects had not made significant gains over their controls who had re- 
mained in the regular institution setting. 

Another cohort study of mildly subnormal persons in Lancashire, 
England provided data on IQ change in adulthood (Stein and Susser, 
1969). Initial observations made when the cohorts average age was 12 
were compared with endpoint observations made at ages 20 to 24 years. 
Two classes of subjects were distinguished in the sample. One category 
was socially and clinically homogeneous: all subjects came from the lower 
social strata; none had detectable signs of brain damage; and they tended 
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to make IQ gains. A second category was socially and clinically hetero- 
geneous: the subjects came from several social strata; all had detectable 
signs of brain damage; and they tended not to make IQ gains (Table 4). 

IQ scores at school were based on Stanford-Binet tests; in adulthood 
they were based on the WAIS. Part of the ‘gains’ in IQ score might be 
attributed to a differential response to the two tests, but not to lapse of 
time. It is highly unlikely, however, that all the gain could be accounted 
for in this way. With regard to statistical regression to the mean, the 
direction of the changes in the two classes of retarded subjects were con- 
trary to those expected. The class that made gains in IQ were least subject 
to regression because they had the highest initial IQ scores and because 
they were drawn from social groups with the lowest mean scores. In 
neither group did IQ scores relate to experience of any special educational 
program, although some had attended a special school for retarded pupils. 

In this study, the IQ gains made by the socially and clinically homo- 
geneous group could have occurred at any time between the initial tests 
at pubescence and the subsequent tests in their early twenties. A further 
study of similar retarded persons in another Lancashire special school, 
however, indicated that no systematic gains had taken place before the 
school-leaving age of 16 years (Stein and Stores, 1965). Unfortunately, 
the subjects’ experience after leaving school is not known. 

There was strong evidence that the mild retardation in childhood of 
the group without overt brain damage could be attributed to their en- 
vironment, especially to their cultural environment (Stein and Susser, 
1960). The evidence rested on the contrast of the “no overt brain damage” 
group with the “detectable brain damage” group with respect to homo- 
geneous social characteristics, their segregation by social class, their history 
of social mobility in three generations, and their subsequent improvement 
in IQ and social performance. We concluded that their IQ gains repre- 
sented recovery from retardation of mental development culturally induced 
during childhood. Their recovery can therefore be seen as delayed ma- 
turation, an explanation supported by the results and the interpretation 
of Clarke and Clarke. We noted above that the subjects in the experi- 
ments of Clarke and Clarke made long-continued increments that were 
not affected by an experimental training program. Such increments can 
not readily be attributed to the social transition of admission to the mental 
deficiency institution (see proposition 2.2) not to its regular training 
program. This does not mean that an appropriate social or educational 
milieu could not accelerate the maturation process, but the effects of such 
efforts remain to be demonstrated by experiment. 

IQ Scores of Children Who Are Either Mentally Retarded Or 

At A Marked Social Disadvantage Change In Directions Pre- 

dictable From Certain Social Experiences or Social Stimuli 


STEIN & SUSSER MUTABILITY OF INTELLIGENCE AND EPIDEMIOLOGY OF RETARDATION 


TABLE 4 


IQ Changes by Clinical States in a 
Follow-up, at Age 20 to 24 Years, of a Sample of 50 


Subject: Clanified Educationally Subnormal et Schoo! 
WAIS Score at Follow-up Compared with Stanford Binet Score at Schoo! 


+Total inaia) lea semenes Dir remse Mese az, 
Clinical stase cae wor Ci se Yaa i ‘oon 
No brain - E erra 
diorder » ns a ’ I “a 
Brain 
meder oe] Gs J ? . -12 


+ 
Fow 
Fou coum eniuded, dhe boon of sate Nentag deer, opd a fourth bocna NG tt was ou pafanmed Wy se anaki ui. AD these 


Unchanged means + 3 points of the original score which is the standard error of the mean for the grep. 


® 
Testing the difference in increments bermeen ‘no besin disorder’ and orain disorder,” the ¢ best gives fen S40 with 4444. (P < OR). 
Adapted from Stein and Sumer (1960). 


Declines of IQ With Age Among Those At A Social Disadvantage Can be 
Partially Attributed to Inadequate Schooling 

Gordon’s hypothesis that a declining gradient in IQ with age would 
result from insufficient schooling was given support by a Swedish study 


military induction, those who had gone beyond primary school made 
substantial gains in IQ when compared with those who had not (Table 
5). This result held through each level of childhood IQ. The result 
could not be explained solely by the selection of the higher social classes 
for secondary school attendance. Nor could the result be attributed to 
regression to the mean. Regression would not differentiate within the 
same social classes those who had gone beyond primary school from those 
who had not, so that the gains of low scorers were greater, and the losses 
of high scorers were less. 

If more prolonged and extensive education leads to improved IQ 
scores, one might hope that equal schooling would cancel the disparity in 
the school performance of children from different social classes. An English 


TABLE 5 
Mean IQ At Army Induction, For IQ Groups Determined By Testing 
; Stockholm 1948. 


At Schoo! 10 Years Previous, By Extent Of Schooling: 


1Q 1948° 
sine & A 
Primary School Only Schooling & Above All 
1Q 1938 N N Mean Score N Mean Score Mean Score 

65- 29 28 16 1 105 T 

75- 79 B 86 6 101 87 
85- 123 u3 92 10 101 93 
95- 125 %4 98 31 104 <9 
105- 145 9 101 66 15 107 
115- 92 39 108 53 118 114 

mw E E e a E 

Distribution of 


s scores and standard deviations not given. 
Adapted from: Husén, (1951). 
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study, (Central Advisory Council, 1954), does not encourage this view. 
The progress through school of children selected at age 11 for the academic 
stream (the grammar schools) was followed and related to father’s occupa- 
tional class. Among children from the lower social classes in the brightest 
third at entry, a marked fall in rank on school performance had occurred 
by the time they reached school-leaving age. Among children from the 
higher social classes in the dullest third at entry, a marked rise in rank 
had occurred (Figure 3). The working class children in this study were 
selected for and attended the same superior schools as the higher class 
children, but they showed a deterioration in school performance. This 
deterioration was apparent even after early school-leavers had been 
weeded out through social and self-selection. A possible bias entered into 
these results because the classification into three groups at 11 years of age 
was made on rank within schools, and the classification at 16-18 years 
was made across the whole school population. The ranking of lower-class 
children could show to better advantage in schools where there was a 
higher proportion of lower-class children (as at age 11) than in the 
sample as a whole (as at ages 16-18). 


PERCENT 


7 
Z 
j 


j 
j 
Z 


i} n m Iv v 1 " m ' 


SOCIAL CLASS SOCIAL CLASS 
HIGH SCORERS (RATED IN TOP THIRD AT II YEARS) LOW SCORERS (RATER IN BOTTOM THIRD AT I! YEARS) 


Fig. 3. Percent in Top and Bottom Thirds of Academic Achi j- 
among 6093 English “Grammar” School Children, within Bech of Tre cal 
Classes Determined From Father’s Occupation, for Top and for Bottom Thirds of 
apie swe = Sst at Peen Years of Age, (Plain Bar—Top Third 
j- ears; Single-Hatc! ar—Bottom Third 16- 
from Central Advisory Council for Edteation KEAN PAGES ia 


The trends are not consistent with a regression phenomenon. The mean 
school performance levels of the higher social classes are not likely to 
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be above the mean for grammar school entrants, who represented the top 
20 to 25% at 11 years of age. Yet whatever the initial rank of the children, 
on reassessment the higher social classes tended either to remain in or 
to move into the highest performance levels; the children of the lower 
social classes tended to remain in or to move into the lowest performance 
levels (Figure 3, Table 6). Such intellectual divergence among children 
starting from the same base line must surely be socially induced. Schools 
adequate for the needs of higher-class children may not be adequate for 
the needs of working-class children. 


TABLE 6 
Percent Distribution of Superior, Average and Inferior Academic Ratings of English “Grammar” 
School Children at 16-18 Years, by Academic Rating on Entry at 1! Years, 
within Five Social Classes Determined from Father's Occupation 


o an (m) av) v 
Professional 
and Semi- 
Managerial Clea Skilled Skilled Unskilled 
Atl Years At 16-18 Years N= NaS N29  N=M6 N=% 
ara FNE 02 es 3s {54 ies 
A jA 
Inferior 99 186 264 379 540 
All 100 100 100 100 100 
Superior 618 533 £26 17 256 
Average Average 128 152 165 146 120 
Inferior 254 315 409 511 24 
All 100 100 100 100 100 
Superior 483 363 326 28 128 
Interior poraa 182 215 156 15.1 107 
Inferior 335 22 518 e21 764 
All 100 100 100 100 100 


* Father's occupation was unclassified in 651 cases. 
Adapted from: Ministry of Education: Central Advisory Council for Education (England). 

The social class divergence in scholarly performance seems not to 
persist through all stages of education. Social class differences in perform- 
ance did not appear in a cohort of English university students five years 
younger than those in the grammar school study nor in studies by Brocking- 
ton and Stein (1963; Table 7). Continued social selection for the academic 
culture and the socialization that accompanies selection presumably account 

TABLE 7 
Percent Distribution of Scholastic Grades in Baccalaureate Degrees of the 


1958 Cohort of Manchester University Students, By Father's Occupation (Sexes Combined) 
No 


infor- 

Semi Father mation 

and occupied about 
Degree i m EF 3 EA Al 

manual manual . 

status gaw Nem N=% (NET) (N= Tos) (N= 1502) 

Superior pass 26 23 24 10 20 22 
Above average pass 29 33 38 24 26 30 
Ordinary pass 30 27 28 35 31 26 
No degree obtained 16 17 10 31 24 16 
Total 100 100 100 100 100 100 


Adapted from: Brockington and Stein (1963). 
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for the equal university performance among the social classes. The absence 
of a regression effect at university level supports the argument that regres- 
sion does not account for the results found at the high school level. 

For present purposes we have regarded the IQ as “given”. The disparity 
between social classes in average IQ scores that develops during the school 
years could be due to a lack of validity of the IQ measure rather than to 
disparity in the developing capabilities of the children measured. It might 
be that the content of the tests devised for older age groups becomes 
increasingly saturated with experiences and concepts of the culture of the 
higher social classes. It might also be that as children in the lowest 
strata become older, their culture becomes increasingly differentiated 
that of the majority of children, whether or not the cultural balance 
the content of the tests remains the same. This culture might induce a 
trained incapacity for handling IQ tests. 


Whether or not these reservations about the validity of IQ tests are 
accepted, they do not detract from the social significance of a decline in 
performance on IQ tests at older school ages. Since the scores reflect 
scholarly capability within the school setting, children of the lower social 
classes judged by this capability are at a progressive disadvantage as they 
pass through school and the social transitions of adolescence. Some of 
these children are presumably pushed or held below the threshold accepted 
as normal in the school setting, and thereby they are classed among the 
mildly retarded. 

In Mentally Retarded Children, IQ Change Accompanies Change in 
Residential Setting, and Direction of Change Depends on Type of Setting — 

Skeels and Skodak (1945) carried out one of the earliest longitudinal 
studies of the intellectual performance of retarded children. Their study 
is one of the best documented and the one that has been longest followed 
(Skeels, 1966). Infants and young children were removed from one insti- 
tution, an orphanage described as “affectionless”, to another that pro- 
vided more intimate personal contact with adults. They were tested and 
retested during a three year period, and then followed up 25 years later. 
This study has the advantages of a prospective, quasi-experimental design, 
although the controls were not randomly assigned (indeed, they were not 
assigned at the same time as the cases) and numbers were small. In its 
first stages, the study faced powerful criticism because of its imperfections 
of design (McNemar, 1940), but the longterm follow-up has yielded a 
cogent result. 

The IQ’s of the experimental and contrast ups, as far as testing 
of young children can be taken as reliable, oad bees at ae oat 
Intervention involved removal from the impersonal care of an orphanage 
staff and consignment to the personal care of mentally retarded women 
in another institution. The children in the care of the mentally retarded 
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women soon reached normal levels of development for their ages (Table 
8). All were adopted but one. Thereafter, their life-tracks were indis- 
tinguishable from those of children raised by their own families. The 
contrast group who remained in the orphanage mostly remained at a low 
development level and followed the semi-dependent life-track in residential 
institutions characteristic of many retarded persons. In this result, although 
the orphanage undoubtedly was adverse to development, the effects of 
removal from one influence and exposure to another are inextricably re- 
lated, Still a third element, the transitional event itself, may have been 
decisive and critical. This effect too is inseparable from the subsequent 
continuing experience of family affection and support. 


TABLE 8 


IQ Scores* of an Experimental Group Removed from an Orphanage to the 
Care of Women in an Institution for the Mentally Retarded, Compared 
with Scores of a Contrast Group Remaining in the Orphanage. 


Duration of Mean 
Age at experiment scores at Scores at 
start in months in months start end 


Experimental 


group (N = 13) 183466 189+116 643+ 164 91.8115 


Contrast 


group (N = 12) 
* Kuhlmann - Binet (1922). 
Adapted from: Skeels, H. M. (1966). 


Other studies have further emphasized the importance of the con- 
tent and structure of personal relations to intellectual performance and 
to the mode of children’s development (Hunt, 1961; Bowlby, 1952; Hess 
and Shipman, 1965, Bernstein, 1961). Some of these studies indicate that 
certain institutions are worse than others for a variety of indices of de- 
velopment (Lyle, 1960; Tizard, 1964), that institutions are worse than 
families (Steadman and Eichorn, 1964; Stein and Susser, 1966), and that 
oe family settings are worse than others (Stein and Susser, 1967; Gross, 

67). 

It appears that the more intimate and exclusive the child’s relation 
with adults is the more favorable the setting is to his development. But 
the evidence can not be held to point solely to the role of “affection. 
Other factors such as interaction vary in an equally consistent fashion. 
Tizard’s experimental study, in which matched pairs of children were 
randomly assigned as cases and controls, represents well the complexity 


166432 30.758 867+ 13.9 60.5+9.7 
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of the treatment input that is to be expected with environmental change. 
He transferred 16 “severely subnormal” children from a large conventional 
institution for mental deficiency to a small institution (Brooklands) run 
on family-group lines. The new environment provided more and warmer 
personal contacts: a staff member was assigned to each child; peer group 
relations were built on a family model; and there was generally more 
individuation. Activity was more varied, stimulating, and independent 
than in the large institution. The experimental group, compared with 
the control group, made gains in certain developmental and verbal IQ 
scores but not on performance scores (Table 9). 


TABLE 9 


Verbal Mental Age in Mentally Subnormal Children Removed From a 
London Mental Deficiency Hospital to a Residential Home, Compared with 
Matched Controls at Outset and After 18 Months 


Initial testing** Final testing* 
Mental Mental 
Mean verbal age Mean verbal age 

scores* equivalent scores* equivalent 
Brooklands 
(N = 16) 63.06 22 months 83.38 32 months 
Controls 
(N = 16) 63.84 22 months 72.69 26 months 


*Minnesota Preschool Scale. Mean Verbal Scores based on “C Scores,” units of 
equal interval, some of which were extrapolated below the lower limit of the test. 

**No significant difference between Brooklands and controls. 

+P < .005 between Brooklands and controls, 

Adapted from: Lyle (1960). 


The negative effect of the admission of children to conventional in- 
stitutions can be inferred from the studies of Skeels and of Tizard. Such 
a negative effect was more directly demonstrated by the observational 
cohort study of Sternlicht and Siegel (1968). In this study subjects were 
stratified by age, sex, and degree of retardation and followed through 
two years after admission. Among children aged 5 to 11 years, IQ declined 
despite emphasis on educational programs. Among older children and 
adults, IQ changed little (Table 10). In the study of Steadman and 
Eichorn, 10 children with Down’s syndrome, who were admitted to an 
institution similar to Brooklands before the age of four months, did not 
do as well on development scores as a matched contrast group of 10 chil- 
dren with the syndrome who had remained at home. In Kirk’s (1958) 
experimental study of children in an institution, children in the contrast 
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TABLE 10 
Mean IQ On Admission To A New York State Mental Deficiency Institution. 


And Sex Groups, By IQ Level 


Subjects (N) InitialIQ Terminal IQ p+ 
Age 5-11* 
low IQ (8) 39.5 29.4 01 
Boys 
high IQ (8) 62.2 52.9 05 
low IQ (8) 41.7 31.8 05 
Girls 
high IQ (8) 56.5 47.4 05 
Age 12 - 18** 
low IQ (8) 40.6 37.5 NS 
“Boys 
high IQ (8) 62.5 60.1 NS 
low IQ (8) 41.1 36.8 NS 
Girls 
high IQ (8) 57.3 55.7 NS 
Age 20 - 50*** 
low IQ (8) 34.5 34.2 NS 
Men 
high IQ (4) 53.5 51.5 NS 
T low 1Q (8) 34.4 33.1 NS 
omen 
high IQ (8) 58.1 58.8 NS 
* Stanford-Binet, 1960 Revision. 
** 1960 Binet, WISC or WAIS. 
*** Binet or WAIS. s 
+ t-test, one-tailed (Distribution of scores and S. D. not given). 
Adapted from: Sternlicht and Siegel (1968). 
TABLE 11 
Mean IQ's of Experimental and Contrast Groups of Mentally Subnormal Children 
1. Treated in the Community 
Il. Treated in an Institution 
1 Community TI Institution 
ital Contrast Experimental Contrast 
a os R ca 
aon 725 (45)* 158 (44) 61.0 (4-4) 57.1 (4-8) 
Fe hag +112 06 +120 -72 
Talon ap +05 +15 =18 +07 
Total change +117 +69 +102 -65 
* Mean chronological age in parentheses. 
** After one year of regular school. 


Adapted from: Kirk (1958). 
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group showed a decline in IQ during the study period, although a special 
educational program produced a rise in IQ among the 15 children in the 
experimental group (Table 11). 


Admission to an institution has not invariably retarded intellectual 
development as measured by intelligence tests. Thus, we have noted 
above that Clarke and Clarke found gains in IQ among moderately re- 
tarded young adults in an English institution. 


The conflict between the results of Clarke and Clarke and those of 
Sternlicht and Siegel (who found no rise in IQ in older age groups) 
could be due to the social conditions of the institutions relative to the 
sources from which the institutions drew. The conflict could also be due 
to confounding factors that resided in the social and intellectual threshold 
criteria for the selection of subjects for admission, and thereby for inclusion 
in the studies. In Clarke and Clarke’s study, the same confounding factors 
might also explain the apparently beneficial effects of the institution. The 
gains were most marked among those from the most adverse family settings 
(Table 13). The “before and after” control does not control for spon- 
taneous gains in IQ that might have taken place, in that group particular, 
without a change in situation. 

May we conclude that many conventional institutional settings de- 
press the IQ scores of subnormal children. Removal to more favorable 
settings can ameliorate these handicaps. 

In Mentally Retarded and Markedly Socially Disadvantaged Children, IQ 
Gains Follow Specially Designed Educational, Social and Medical Inter- 
ventions 

Kirk (1958) made a number of small quasi-experimental intervention 
studies among retarded pre-school children. He selected some of his sub- 
jects from day and residential institutions for the retarded and some from 
families living in a slum. An attempt was made to make the contrast 
groups comparable in age and place of residence, but there was no random 
assignment. The treatment had two elements: 1) a pediatric examination 
of each child aimed to discover and treat physical handicaps to learning 
and 2) through individual attention and close relation with an adult, 
a nursery-school program aimed at instilling work habits, interest in 
materials, and class-room skills. The comparison groups were kept in 
their ordinary setting, whether home or residential. The intervention was 


confined to the classroom and only passing observations were made of 
home, family, peers and institutional life. 


At the end of the experimental period (average one and a half years), 
the treated groups reached developmental gains compared with the con- 
trast groups. As can be seen from Table 11, the advantage of the treated 
group had become less marked one year after the experiment stopped. 
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Fig. 4 M Stanford-Binet 1Q’s of Two Treatment and Two Control Groups of 
Preschool Children at Dac Tests During Treatment and Follow-up. Con- 
structed from Klaus and Gray (1 


Gains occurred in children with and without organic brain damage; but 
they were greatest for those without damage. The gains in those with 
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organic lesions might have been attributable in part to the correction of 
physical handicaps. 

In the experimental study of Klaus and Gray (1968), educational 
treatment of children from families of low social class was supplemented 
by attempts to influence the home environment. The educational expe- 
rience was repeated for two or three summers, and a home visiting pro- 
gram was sustained in the intervening months. Thirty children from the 
poorer social strata in North Carolina were assigned at random to three 
groups: one was treated during three summers; one was treated during 
two summers; and one had no treatment. A non-random control group 
from another town was added. The results of the experiment were com- 
plicated by the fact that random assignment did not achieve compara- 
bility between the small numbers in each group. Nonetheless, an immediate 
if irregular beneficial effect was observed in the two treatment groups. The 
effect seemed to persist throughout the duration of the treatment (Figure 
4). 

Nearly all of the experimental intervention programs to ač'ance 
intelligence were aimed at children of pre-school or early school age. At 
later ages, observational cohort studies of children who attended special 
schools for the mentally retarded yielded a little data about the effects 
of special pedagogic programs on subnormal children (Stein, Susser, and 
Lunzer, 1960). Some of the better designed of these studies showed bene- 
fits in scholastic attainment, but they were not addressed to the question 
of IQ changes (Goldstein, Moss, and Jordan, 1965). In the experimental 
study of Clarke and Clarke noted above, however, no differences were 
produced between groups randomly assigned to a special program and 
to a routine program. 

Well-controlled trials of pedagogic intervention are scarce, but in 
general, the data available suggest that in children, intervention can 
produce a measurable increase in low IQ scores (Hodges and Spicker, 


1967; Spicker, Hodges and McCandless, 1966; Sprigle, Van de Riet and 
Van de Riet, 1967; Welkart, 1967). oc idling 


Duration Of Favorable Effects Of Social Experience On Children’s 
IQ Depends On Type Of Experience j í 


Advantage In IQ That Groups Of Preschool Children In Short-Term 


Pedagogic Programs Achieve Over Compari intai 
We E parison Groups Is Not Maintained 


Some pedagogic experiments attempted to achieve economy in treat- 
ment. They used short-term treatments at a phase of the child’s develop- 
ment presumed to be sensitive; their ostensible object was to produce a 
favorable turn that would be self-sustained in later years of growth, The 
programs of Kirk, of Klaus and Gray, and some of those associated with 
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Headstart allow this approach to be tested because observations were 
taken before and after the intervention and in succeeding years. 


In the study of Klaus and Gray the experimental groups also lost 
some but not all of their advantage over the comparison groups in the 
two years of follow-up after the treatment had stopped. This loss of 
advantage again arose not so much from deterioration in IQ as from the 
greater gains made by the untreated groups on their entry to school dur- 
ing the follow-up period. 

A similar conclusion might be reached from the evaluation of year- 
long Headstart programs carried out by Westinghouse Learning Laboratory 
and others (1969). Among Grade I children, those who had participated 
in the programs were compared on a reading-readiness test with those 
who had not. Some advantage was shown for participants. Among Grade 
II children, those who belonged to an earlier cohort of Headstart par- 
ticipants were compared on achievement tests with non-participants. No 
advantage was shown for participants (Table 12). It might be inferred 
that the initial advantage conferred on Grade I children by the Headstart 
program was not maintained through Grade II. The comparisons of dif- 
ferent school experiences in succeeding years, reduces the strength of these 
results. 


TABLE 12 


Test scores of a national cross-sectional sample of children who had partici- 
pated in year-long pre-school (Headstart) programs and children who had 
not, at one year (Grade I) and two years (Grade II) after the programs 


had ended. 
Probability 
that the 
differences 
are due 
Participants Non-participants to chance 
Grade I (mean scores) * 51.76 48.46 0.04 
n = 432 
Grade II (mean scores)+ 1.59 1.57 0.50 


E ee a a 
* Metropolitan Reading Readiness test. 
* Stanford Achievement test. 

Adapted from Westinghouse, ete. (1969). 


Thus, intervention of a pedagogic type So far provides little evidence 
of accelerated development that continues self-sustained after the program 
ends. A number of factors, separate or combined, might have caused the 
loss of advantage in IQ that followed the end of educational intervention 
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in these studies. The relative loss might have been due to the nature 
the stimuli that produce such changes. The non-specific stimuli of th 
social transition at school entry and subsequent exposure to school 
have affected the control groups in much the same way as the s 
programs affected the experimental groups. The loss of advantage ( 
also have been due to the short duration of the stimulus of the special 
programs. A sustained stimulus might be needed to sustain an accelerated 
rate of development. All the pedagogic programs analyzed here were set 
in unfavorable social circumstances, and their effects can be conceived 
as altering favorably the balance of forces impinging on the children. 
When the program ended, the unfavorable forces existed unopposed. Such 
after-affects of the programs as there may be seem to be insufficient in 
themselves to counterbalance an unfavorable array of existing forces. 
The studies analyzed under this proposition show that specific pro- 
grams have an effect on IQ, but they also show that non-specific everyday 
school programs and their accompanying social transitions have an effect, 
at least in the range of ages 5 to 6 years. The studies do not tell how 
long a program should be sustained to give maximum benefit to those 


exposed to it. 


Enduring Gains In IQ Follow Physical Removal Of Children From One 
Social Milieu To Another 


Unlike short-term pedagogic programs, certain types of experience 
produced long-term effects on IQ. These effects, either self-sustained 
after a brief intervention or continuing during prolonged exposure, re- 
sulted from experiences that encompassed the total social environment. 

In Skeels and Skodaks’s Iowa study of orphanage infants, the single 
experimental intervention was a total change in social relations, from 
which many consequences flowed. This study illustrates effects that con- 
tinued self-sustained after the initial intervention. The removal of the 
infants from the impersonal professional mode of care of the children’s 
home to the personal mothering of mentally retarded young women 
stimulated their intellectual development and the improved intellectual 
development made the infants eligible for adoption. The intervention thus 
led to a point of critical decision that set the children on a new life track. 

Lee (1951) elaborated the pioneering cross-sectional studies of Kline- 
berg (1938) on the effects of TAIR RRS on IQ. ae work provides | 
apt illustration of the sustaining effects, on intellectual development, of 
various durations of exposure to a particular social experience. He com- 
pared the IQ’s of cohorts of native-born and migrant black children in 
Philadelphia as they advanced through school from first to ninth grade. 
The IQ’s of the Southern-born migrants improved in direct relation to 
the years of exposure to Philadelphia city life. Each cohort of Southern 
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TABLE 13 


Mean IQ's of Cohorts of Native and Migrant Block Children, by Grade 
at First Entry to Philadelphia Schools, in Successive Grades 
e S a a 
Group*** ae 2B 4B 6B 9A 

Philadelphia- born 
who attended 
kindergarten (N = 212) 96.7 95.9 m2 ns 6 
Philadelphia-born 

who did not attend 

kindergarten (N = 424) 921 93.4 MT s0 s7 
Southern- born 
entering Phila- 
delphia school 
system In grades: 


1A (N = 182) 865 89.3 18 
1A -2B (N = 109) 867 886 
3A -4B (N = 199) 863 
SA -6B (N = 221) 
TA -9A (N = 219) 


* Philadelphia Tests of Mental and Verbal Ability. 
** A refers to first half of school year, B to second hall. 
*** Out of a total of 1234 migrants, 304 were excluded for missing tests (292) oe attending kindergarten (12). 325 Philadelphis-bors children 
were excluded for missing one or more tests, 
Adapted from: Lee (1951). 
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migrant children had approximately the same initial IQ, regardless of age 
and grade at entry, and each made similar gains for each year of exposure 
(Table 13). Thus, migrant children who entered school in the highest 
grades and had the shortest stay in the city made the least over-all gains; 
migrant children who entered the school in the lowest grades and spent 
longest in the city attained IQ’s closest to those of Philadelphia-born 
children, From the available data, the effect of duration of stay in Phila- 
delphia can not then be attributed to any particular “treatment” or to 
any specific social, economic or educational element of the new way of 
life of the migrants. 

One source of possible error was the large number of children who 
were excluded from the study because they were not retested. These ex- 
clusions could detract from the validity of the result obtained if there 
was a systematic bias toward excluding children of low IQ from the 
migrant samples only, or of excluding children of high IQ from the native 
sample only. 

The social background of the children before migration was not re- 
ported. Therefore, no inferences about effects on IQ of the Southern 
environment can be drawn from the similarity of the mean IQ of each 
grade cohort of migrants at entry, and the data can not be compared with 
those studies that show a decline in IQ in older age groups. 

Not all effects produced by exposures of a social type have been sus- 
tained. The experiments of Klaus and Gray contained both a social and 
a pedagogic component, but IQ levels of the experimental groups showed 
the beginning of a decline at the two-year follow up. In the elegant 
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Fig. 5. Mean Change in IQ Scores on Flanagan’s 
Elapse of Varying Time Periods Following E 
Different Grades 2, For Different 
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if somewhat manipulative study of Rosenthal and Jacobson (1968), the 
“treatment” was to prime teachers with the expectation that certain 
randomly selected children designated as “late bloomers” were about to 
manifest accelerated development. The research procedures were severely 
criticized by Thorndike (1968) and Snow (1969), but if one takes the 
results at face value, the children designated as “late bloomers” showed 
accelerated intellectual development, in contrast with children for whom 
teachers’ expectations were not raised (Figure 5). These experimental 
children did not entirely lose their advantage when transferred to new 
teachers, but younger boys in particular showed a decline from the peak 
IQ attained in the experiment. 


The social stimulus applied by Klaus and Gray and that contrived by 
Rosenthal and Jacobson merely complemented the children’s normal e1- 
vironment. In studies that have found sustained effects, the stimulus 
created a radical change in the social situation of the children by trans- 
ferring them from one social environment to another. Data with which 
to test this generalization are sparse, but those few studies that are avail- 
able provide a strong lead. 


Intervention Effective In Producing IQ Change Has Not Been 
Limited To A Well-Defined Critical Peri 


In the 1930's the biologist Childe made the generalization that an 
embryo during the phase of its maximum growth rate is in a “critical 
period,” that is, at its most sensitive to hurt. Several workers in the field 
of learning invoked the theory of the critical period to choose a time to 
intervene and modify intellectual development. Most of these formulations 
about the critical period were vague and variable, and the choices of 
chronological age by which to define these periods were conflicting. In 
the human brain, growth rate is maximal in the months before and after 
birth. Consequently, the choice of a period for intervention, as repre- 
sented in the literature on the mutability of measured intelligence, was 
not founded on the maximum growth rate of the structure most closely 
linked with intellectual development. The periods chosen span a number 
of phases of the development of physique, intellect and personality, and 
such social transitions as entry to nursery school and primary school. 


There is little evidence that at some ages measured intelligence is 
more mutable or more sensitive to stimulation than it is at others. Such 
evidence requires that cohorts exposed to a treatment at some ages re- 
spond to it, but comparable cohorts exposed to similar treatment at other 
ages do not respond. The comparability of cohorts of different ages is a 
special difficulty. In those studies that best meet this requirement, the 
precision in age intervals required to isolate a critical period was not 
usually found. To take an example of deteriorative change, Sternlicht and 
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Siegel used the age interval of 5 to 11 years to demonstrate the fall in IQ 
in young retarded subjects after their admission to a large mental de- 
ficiency institution, and the intervals 12 to 18 years and 20 to 59 years 
to demonstrate the relative stability of IQ in older retarded subjects, 
Rosenthal and Jacobson in their California study of normal school children 
presented results more discriminating for age. Their experiment produced 
a sharp rise and decline in IQ among children of grades 1 and 2, whereas 
among children of grades 5 and 7 there was a lesser but more gradual and 
sustained rise (Figure 5). 


Some studies of children at a marked social disadvantage showed, as 
we noted above, that the children’s intellectual handicap, as compared 
with their age peers at better advantage, increased steadily each year. In 
other words, environmental effect was not limited to a critical period. 
IQ gains were inferred even among people old enough to be socially de- 
pendent. In New York City, Weinstock and Bennett (1968) tested nursing 
home patients over 65 years of age. Twenty on the waiting list for ad- 
mission scored lowest. Twenty, who comprised a series of consecutive 
admissions, were tested as newcomers and scored highest. Twenty who 
had resided there for more than one year had intermediate scores (Table 
14). Although the comparability of the groups in this cross-sectional study 
can be questioned, the pattern of scores suggests that the transitional 
event of admission had stimulated intellectual performance and that the 
gain was not totally lost thereafter with habituation to the nursing home. 
The gains could not apparently be attributed to failures to comprehend 
the test, to problems of attention, or to practice effects. In this study, ad- 
mission to the nursing home represented an unplanned intervention. 


TABLE 14 


Mean Total Scores On W.A.LS. In Waiting Lists, 
Newcomer And Oldtimer Groups In A 
New York Nursing Home 


Waiting list | Newcomers Oldtimers P.* 
Mean (total) WAIS score N=20 N=20 N=20 
Mean (total) WAIS score 32.55 38.05 36.35 001 
Se NOS āū O 


* Chi squares were computed for differences found a f 
Adapted from: Weinstock and Bennett (1968). oe 


There is thus no good evidence so far t is onl 
one period of intellectual develo Senne ee 


l at development sensitive to external intervention. 
Furthermore, a logical distinction should be made between factors that by 


chance affect the course of intellectual development and a deliberate treat- 
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ment to alter that course. A period that is critical because individuals are 
especially vulnerable to adverse influences need not coincide with a period 
that is critical because individuals are especially responsive to treatment. 
Not to make the distinction is to risk a fallacious causal inference. Con- 
versely, successful treatment of a condition does not invariably have to 
remove, negate, or mitigate its causes. On present knowledge, intervention 
could be justified throughout the recognized period of mental development 
(sec Fowler 1962a, 1962b) and possibly later. 


IQ Level Of A Majority Limits The Extent Of IQ Gains That 
Can Be Induced In Subgroups Sharing Educational And Social 
Experiences Of The Majority 


This proposition indicates that the majority culture imposes a ceiling 
on the development of social subgroups. In its broadest terms, the propo- 
sition only affirms that the social milieu is a determinant from which no 
group can entirely escape. Treatment programs do not take place in a 
social vacuum. Indeed, their rationale is that treatment can overcome the 
depressing influence on IQ exerted by social, family and educational 
settings, Proposition 3.1 described the limited duration of the effects of 
short-term treatment and observed that if an experimental program acts 
in one direction and the influence of the surrounding social milieu in 
another, then cessation of the program leaves the influence of the milieu 
unopposed. 

That a ceiling on the development of social groups is imposed by the 
milieu seems to be the most convincing interpretation of the regularities 
in the IQ scores of the cohorts of black children that Lee followed through 
their school careers in Philadelphia. The development of the Southern 
migrant children also involved their acculturation to the black culture of 
the lower social strata of a Northern city. An index of the acculturation 
process is duration of exposure to city life. Table 13 shows that for each 
year of exposure to the city, the mean IQ of each migrant cohort (defined 
by the grade at which the children entered the Philadelphia schools) rose 
steadily towards that of the native-born of the same age (although chil- 
dren with Kindergarten experience kept ahead). After nine years exposure 
the migrant cohorts reached but never exceeded the level of the native- 
born. The milieu seemed to stimulate and to limit the development of 
the migrants. 

The higher scores of children who had attended kindergarten prob- 
ably reflected social selection as much as social experience; these two 
elements can not be separated from the data given. 


__ Katz (1967) interpreted the reports of the U. S. Commission on Civil 
Rights (1967) and of Coleman et al. (1966) to mean that the intellectual 
performance of children of ethnic groups who comprised a minority in a 
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school class seemed to cleave to the performance of their classmates. In 
these studies, the scholastic performance of children was compared in 
schools where white and black studen:s were mixed in various proportions. 
Coleman et al. made an extensive cross-sectional observational study of 
educational performance. Their samples were taken throughout the United 
States and included all grades and ethnic groups. With family background 
partialled out, attributes of fellow-students accounted for more variation 
in the achievement of minority group children than did any attributes of 
school facilities, and for slightly more than did attributes of staff (Table 
15). Although knotty problems of interpretation beset these cross-sectional 
data, particularly the selective factors that bring children into the position 
of minority ethnic groups within school classes, the results are consistent 
with the tenor of our proposition. Hence, we consider it likely that the 
higher the intellectual standards that are set by the milieu, the better one 
can expect the intellectual performance of the exposed children to be. 
TABLE 15 


Percent of variance in verbal achievement uniquely accounted for by one variable 
representing each of: School facilities (A), curriculum (B), teacher quality (C). 
teacher attitudes (D), student body quality (E), at grades 12. 9 and 6. 


Joint Unique 
ABCDE Common A B c D E 
Grade 12 
Black children 12.43 5.58 02 1.01 02 03 6.77 
White children 50 01 0 0 0 201 
Grade 9 E 
Black children 8.21 3.99 01 0 08 08 405 
White children 1.88 -06 02 08 06 09 1.69 
Grade 6 
Black children 9.38 285 0 03 0 01 6.49 
White children 437 -06 03 0 05 09 4.26 


‘Taken from Coleman et al. (1966) table $23.1. 


The More Unfavorable Are the Social and Educational Origins 


r 7 Group, The Greater Is The Potential That Exists For Change 
n 


; To compare black children with white children in the United States 
is to compare social disadvantage with advantage, Black children show 
at a disadvantage on social indices even when social class (as measured 
by occupation or education or income) is held constant. In the study of 
Coleman et al. across the United States, black children performed at lower 
intellectual levels than children of other ethnic groups. At the same time, 
the analysis suggested that intellectual performance of black children was 
handicapped more than that of white children by the environmental con- 
ditions of curriculum, school facilities, the quality and attitudes of teachers 
and the intellectual performance of schoolfellows (Table 16). 


Support can be marshalled for the opposite expectation, that in favor- 
able conditions those at the greatest disadvantage will make the greatest 
gains, from three other studies cited above. In Kirk’s experiment in Illinois, 
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the pre-school children from the most culturally deprived families made 
the greatest gains in intellectual performance. In Klaus and Gray’s ex- 
periment in North Carolina, the authors found that the pre-school chil- 
dren in one randomly chosen group came from families rated lower on 
economic and social indices than another such group and that the group 
with the lower rating made the greatest initial advances. In Clarke and 
Clarke’s observational study of adolescents in a mental deficiency institu- 
tion outside London, individuals who came from home backgrounds rated 
as the most adverse made the greatest gains in IQ (Table 3), although 
again their advantage was due to an initial spurt. Possible confounding 
factors in this latter result were discussed above. 

Coleman et al. studied the effectiveness of the summer Headstart 
program of 1965 in various social strata. They reported a consistent ad- 
vantage for participants over non-participants among poor Negro children, 
particularly those from rural areas; the advantage held when kindergarten 
experience and various socio-economic measures were held constant. Similar 
advantages were not found for participants from social groups that were 
better off (Table 16). A parallel finding is presented in an evaluation 
of the year-long Headstart programs. 


TABLE 16 


Effects of participation in summer pre-school (Headstart) programs on 

Verbal Ability tests of black children and white children from metropolitan 

and non-metropolitan areas of two broad regions by kindergarten attendance 
(Low Socio-Economic Status children only). 


Attended Kindergarten Did not attend 


Black children 
Metropolitan south 


Metropolitan non-south 
Non-metropolitan south 
Non-Metropolitan non-south 
White children 

Metropolitan south 


Metropolitan non-south 
Non-metropolitan south 


Non-Metropolitan HibH SOUL DST ng eae MM 
+ Participants scored higher than non-participants. 
— Participants scored lower than non-participants. 

Taken from Coleman (1966). 
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Stein and Susser failed to find support for this proposition in their 
follow up of mildly retarded persons in early adulthood, when they used 
family structure as the sole criterion of family environment. By the cri- 
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Fig. 6. Salford Mildly Retarded Population by Age and Sex Compared with Census 


Population. Taken from Susser (1968). 
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terion of social position, however, results were consistent with the propo- 
sition that unfavorable origins imply a potential for IQ gains. Their 
clinically normal subjects with mild retardation, who had made substantial 
gains in IQ by the time they reached early adulthood, came from the 
bottom of the social scale; their brain-damaged subjects with mild retarda- 
tion, who had not made gains in IQ by early adulthood, came from all 
social levels (Table 4). 

In terms of mental development, the fact that groups at a marked 
social disadvantage are particularly affected by their circumstances, to 
both good and ill-effect, points to the need for creating a good social en- 
vironment and for alleviating existing handicaps. 


Among Groups At A Marked Social Disadvantage, Changes In 
Mean IQ With Age Correspond With Changes In The Frequency 
Of Mild Mental Retardation 


An outstanding epidemiological characteristic of mild mental retarda- 
tion is its age distribution (Susser, 1968). Mild mental retardation is 
infrequently recognized during early childhood. It becomes frequent at 
pubescence, reaches a peak at adolescence, and declines sharply in early 
adulthood (Figure 6.) The low frequency of mild mental retardation at 
younger ages has been attributed to the undemanding social roles of early 
childhood. As educational demands in particular become more exciting 
at pubescence, deficiencies in individual capacities for role performance 
becomes relatively greater. Thus limited intellectual capabilities come to 
be the more readily recognized and reclassified as mild retardation. 

A second outstanding epidemiological characteristic of mild mental 
retardation is its social distribution. It occurs more frequently in those 
social strata ranked at the bottom of any scale of social class than in those 
found at the top of the scale. The discrepancy is so great as to indicate 
that mental retardation without detectable lesions is virtually specific to 
the lowest social classes and their cultural setting (Stein and Susser, 1969). 
The data assembled in the propositions of this paper reinforce that view, 
in so far as they show that social experience (in addition to labeling and 
recognition) contributes directly to the distributions of mild mental re- 
tardation by age and social class. 

Thus the low frequency of mild mental retardation in children of 
tender age could be due not only to their undemanding roles, but to 


vironment. 
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Similarly, the sharp fall in the frequency of recognized mild mental 
retardation in early adulthood can be linked with the gains in IQ among 
mildly retarded persons at these ages noted above (Proposition 1.2). 
Intellectual and social recovery follows the mild cultural retardation of 
pubescence in the great proportion of cases. 

More needs to be known about the distribution of IQ losses and 
gains and the size of the contribution of age-changes to the frequency of 
mild retardation, but we conclude that for children at a social disadvantage 
pubescence is a highly vulnerable period. The intellectual handicap of 
such children appears greatest at just those ages when a competitive school 
system and specific socialization to the adult world begin to make their 
greatest demands. Although a degree of recovery from cultural retardation 
occurs without special treatment, it is uncertain how much permanent 
damage remains. 


Among Populations In Defined Geographic Areas, Improved En- 
vironment Has Produced A Rise In IQ Levels And A Decline 
In The Frequency Of Mental Retardation 


Improvement in the intelligence test scores of populations over time 
has been demonstrated in several countries, including Scotland, England 
and the United States (Table 17). These demonstrations depended on 
repeated cross-sectional or prevalence surveys of IQ, i.e., they compared 
the distribution of the condition at one point in time with the distribution 
at another point in time. 

In Scotland in 1933 and in 1947, all 11-year-old children took exactly 
the same group intelligence tests; subsamples also took the same individual 


TABLE 17 
Change in Intelligence Test Scores in Three British Areas Over Time: 
Cross-Sectional Surveys in Each Area 
at Two Points in Time. 


Mean Test Scores 


Age Year N Total Boys Girl, 
Scottish Moray House 
Council 
G group u 1932 87,498 34.46 34.50 Al 
ey 
cation n 1947 70,805 36.74 35.88 37.62 
(1949) 
nas Moray House 
(1950) group 1 1935 13,724 42.47 42.12 4283 
n 1947 12,259 4228 40.48 44.13 
Cattell Cattell scale I 
(1950) form 4 10 
fr ton 1936 3,873 100.49 
verbal test 10 1949 3,832 T 
Adapted from: or 
1. Scottish Council for Research in Education (1949). 
2. Emmett (1950). 
3. Cattell (1950). 
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tests (Scottish Council for Research in Education, 1949). In Leicester, 
England, Cattell (1950) tested 10-year-old children in 1936 and in 1949. 
In both studies, the children born later scored higher than their predeces- 
sors. Emmett (1950) examined trends in IQ of I1-year-old English chil- 
dren between 1935 and 1947 from school records. In Emmett’s data, only 
girls of the later generation scored higher than the earlier generation; 
boys scored a point lower. In the United States in a socially depressed 
area of East Tennessee, Wheeler demonstrated a rise in IQ scores between 
two cross-sectional studies separated by a 10 year interval. A sample of 
children of ages 6 to 16 years tested in 1930, and a similar sample was 
tested in 1940 (Figure 1). 

The Scottish is by far the most comprehensive of these studies over 
time. It covered a total population of public and private schools and 
is little subject to the criticism that alterations in the education system 
could have altered the social composition of school classes and raised 
IQ levels. In the second survey, in which the higher mean scores were 
found, the special effort made to locate retarded children probably in- 
creased the proportion of the low scoring population included in the survey. 
Migration did not appear to have been of a magnitude or direction to 
have itself produced the IQ changes found. Improved mean IQ in the 
population might be expected to lower the frequency of mild mental re- 
tardation. In the Scottish study there was a substantial decline in the 
frequency of low scorers between the first and second survey. 


In general, the relative improvement for girls was greater than that 
for boys. In Scotland, 17.9% of the 11-year-old girls tested in 1933 were 
low scorers as compared with 14% in 1947; by the same definition, in 
each survey, 19.6% of the boys tested in 1933 were low scorers and 18.9% 
were low in 1947. One explanation related the greater improvement in 
girls to a social setting at home and in school that was specifically more 
favorable for girls in the later period. An alternative explanation related 
the improvement in girls to the progressive decline in the age of the 
menarche over the same time period, for a relationship between sexual 
maturation and intellectual development has been found. Although during 
this century the onset of puberty has come to begin earlier in boys and 
girls, girls are precocious and at 11 years of age the effects are more appar- 
ent among them than among boys. But the sequence of casual relation- 
ships between sexual maturation, intellectual development and the asso- 
ciated social and family factors has not been clearly established. Thus in 
the National Survey of Child Health and Development in Great Britain 
(Douglas, 1964, 1966) “only” children were the most advanced, both 
in sexual maturity and various measures of intellect. But within the cate- 
gory of only children, the association of intellectual performance with 
sexual maturity disappeared. The result suggested that the social factor 
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of family size was an antecedent explanatory variable that gave rise both 
to advanced sexual maturation and to better intellectual performance. 


The initial object of the second Scottish survey was to test the predic- 
tion that the trend of intelligence was downward. Because low scorers 
were concentrated at the bottom of the social scale, and improving stand- 
ards of living produced higher infant survival rates among them, it was 
feared that large families produced by dull people would become so numer- 
ous that they would lower the mean intelligence of the nation. The con- 
trary finding could have been caused either by a change in the genetic 
composition of the cohorts tested, or by a change in their environmental 
experience. There was no evidence that a change in the relative birth 
rates of the social classes or of dull and intelligent people could have 
reduced the proportion of dull children and increased that of bright chil- 
dren, as would be required by a genetic explanation. A study (Higgins 
Reed and Reed, 1962) showing that the most retarded individuals in one 
institution came from less fertile families than the mildly retarded, is 
sometimes cited as providing such evidence. The data have virtually no 
bearing on this question, which requires data of relative fertility through- 
out the population (Duncar, 1952, and Bajema, 1966). There was unassail- 
able evidence, however, that later cohorts experienced better environmental 
conditions and that this experience was reflected in their physical attributes 
and development. 


In England, support for the expected effect of improved IQ on the 


TABLE 18 


Secular Change in Backwardness at School in England (excluding severe subnormality) 
Ascertainment 1925-1927 and 1955-1959 Compared, 


I Nof Nat 
am ace ge Sell ca risk Rate/1,000 Sources 
5 Urban T- 
Prevalence in area B is 70 181 10,687 16.9 Lents Bs 
6 sample areas ; 
6 Areas ANG 
(3 urban + peed 7-10 70 761 36,692 20.7 Lewis table 
3 rural) P 
(3 urban + All 6 x 
3 rural) areas TH 7 1608 66,380 24.2 aae 
pale ms 7 647 37,743 17.1 Lewis pable 
- 14 p. 
Hein a 70 961 28,637 33.6 Lewis table 
n l4 p. l 
HA in ert 7-10 80° 320 19,500 13.3" Stein and 
Salford Lancs. schools Susser 
at age 10 (1955-59) 


~ I, Point prevalence (ie. The number of iden 
S inem pence a Ts met esa y cases compared yh he Population at risk, at one point in time) was estimated for age 
The data to hand dictate the use of IQ 80 as the upper I ith 10! 
70-79 was not know! u limit in the Salford Lanes study because the ber of children with 10% 
the upper Heit: Taken from Sele ari imaa ied resulted in a higher prevalence rate than would be computed using 10 70 "$ 
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frequency of mild mental retardation was found in a comparison of the 
prevalence surveys of E. O. Lewis (1933) in the 1920's, and in incidence 
data gathered by Stein and Susser (1969) in the 1950's. The case-finding 
methods were not the same in the two periods, but the comparison suggests 
that rates were lower in the later period (Table 18). 

Better intellectual performance in children might be a manifestation 
of an accelerated process of mental development towards the same end 
point. In Sweden, however, better intellectual performance has been ob- 
tained at older ages. Rayner (1964) studied the incidence of rejection for 
deficiency of intellect among men called up for the army during the period 
between 1944 and 1962 and found a declining rate (Table 19). There was 
no indication that the definition of deficiency had been relaxed or that 
standards for acceptance had been lowered, but rather the reverse. Rayner 
suggested that the breakup of geographic isolates had altered the mix 
of inherited traits among native villagers. That the rejection rate for low 
intelligence had declined throughout the region, and to a smaller degree 
all over Sweden, does not fit such a genetic hypothesis, nor does the 
short time-span of these studies allow time for marked genetic change. In 
Sweden, as elsewhere, there is no doubt about the extent of social change 
over the period studied. 


TABLE 19 


Percent Of Mental Defectives Among Men Called Up For The Swedish 
Army, In An Isolated Village, In The Surrounding Region, And In The 
Nation, During Three Consecutive Periods (Numbers Called Up Are In 


Parentheses) 

Year Halnas* Rest of region Sweden 
tear GUNS ee 
1944 - 1948 20.2 (124) 6.7 (9,000) 4.0 (173,389) 
1949 - 1953 11.4 (123) 4.3 (9,854) 3.1 (242,795) 
1954 - 1962 5.0 (159) 3.7 (20,662) (No information) 

tE O ic 20S Ss a i Oe 


Adapted from: Rayner (1964). 

* P. Differences between 1944-48 and 1949-53 = < .01. 
Differences between 1949-53 and 1954-62 = < 01. 
Differences between 1944-48 and 1954-62 = < 0l. 


Thus, increments in the IQ of populations accompanied social change 
in Scotland, England, the United States and in Sweden; these increments 
in IQ evidently affected the frequency of mild mental retardation. Since 
Social change included improvements in nutrition, employment, social 
mobility, and education as well as a reduction in mortality and morbidity 
rates, the improvement in measured intellectual performance cannot be 
related to any specific element of the social change. 
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Conclusion 


Analysis of existing individual and population data about IQ change 
at the lower end of the IQ scale has led us to support a number of specific 
propositions, some more and some less firmly. In general, we conclude 
that there is good evidence in individuals and in populations of systematic 
IQ change through time. Changes in IQ with age and with particular 
experiences suggest that exposure of children to certain social environments 
can have an appreciable effect on their mental development. Both depress- 
ing effects of a poor social environment and stimulating effects of a good 
social environment are most evident in groups at the greatest social dis- 
advantage. The greatest deterioration and the greatest and most sustained 
improvements have been produced by total exposure to a new residential 
environment. 


The mutability of IQ in individuals has found a parallel in appreciable 
improvements in the IQ of populations as the social environment improved. 
The distribution of mild mental retardation varies concomitantly with 
population IQ levels, and evidence suggests that the frequency of mild 
mental retardation is declining. Our conclusion is that improvement in the 
social environment of groups at a marked social disadvantage can bring 
about a substantial improvement in IQ levels and a decline in the fre- 
quency of mild mental retardation. Only radical environmental change 
can be expected to bring about rapid improvement. It seems likely that 
the greatest advantage will come from a serious attack on poverty and 
its concomitants in unemployment, deteriorated housing, physical environ- 
ment, and poor and inappropriate schooling. 
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4: DISADVANTAGED RURAL YOUTH 


EVERETT D. EDINGTON® 
New Mexico State University 


Previous research reviewers have tended to overlook 
the rural student and the characteristics which may be unique to him and 
his situation. In this chapter, I attempt to identify those characteristics 
which, because they are unique, tend to cause the student in rural areas 
to become disadvantaged. During the last few years, a considerable amount 
of material was written about rural America, but little of it was based 
upon research. Although adequate research design is lacking in many of 
the studies, they do tend to give the best picture available of the rural 
student. Much of the material cited in this article is not available in pub- 
lished journals; it came from fugitive documents of limited circulation 
which fortunately are available through the ERIC system. 

A number of writers pointed out that rurality by its very nature may 
have caused pupils to be disadvantaged. Ackerson (1967) stated at the 
National Outlook Conference on Rural Youth that the incidence of in- 
centive to remain in high school or in college was evidently not as great 
in rural America, as shown by the high dropout rate, and in all too many 
cases, the educational and vocational opportunities offered to rural young 
people were quite limited. Lamanna and Samora (1967) obtained similar 
findings in a study of Mexican American youth. They found that rural 
or urban residence was strongly related to educational status. Urban resi- 
dents were almost always better educated than rural residents, regardless 
of sex, age, maturity, race, or parentage. 

It is difficult to make broad generalizations other than those previously 
mentioned, concerning disadvantaged rural students. Such groups as the 
mountain folk of the Appalachian region, the Southern rural Negroes, the 
American Indians, or the Spanish-speaking youth of the Southwest have 
special problems. In addition, characteristics are often quite different 
for persons within the major groupings. Berman (1965) noted that it was 
invalid to consider all Indian students, no matter which tribal affiliation 
they maintained, as “just Indians” and to prepare an over-all program 
which purported to be adapted to the “Indian population.” Similarly, 
it is not acceptable to lump all Spanish-speaking students together under 


“Drs. J. Clark Davis, University of Nevada, and John E. Codwell, Southern Association 
of Colleges and Schools, Atlanta, Ga., served as consultants to Dr. Edington on the 
Preparation of this chapter. 
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the term “Mexican” or some other term, and to consider all Spanish- 
speaking students as having identical learning problems amenable to identi- 
cal educational techniques. 

The problems experienced by the rural disadvantaged student are not 
limited to geographical location. Edward B. Breathitt (1967), former 
governor of Kentucky, emphasized this fact in his statement that the 
conditions of the rural disadvantaged were not confined to any one section 
of the United States. They exist in Appalachia and Alaska, in the Missis- 
sippi Delta and the Midwest, in New England and California. Such 
conditions are widespread enough to be a national problem. 

The major characteristics of the disadvantaged rural student covered 
in this review are socioeconomic status, aspirations, attitudes, educational 
achievement, educational retention, curriculum, and cultural and social 
status. 


Socioeconomic Status 


Poverty is a widespread condition among residents of rural areas. 
Mercure (1967) revealed that one-half of all rural families in northern 
New Mexico, the Mississippi Delta, the Ozarks, and Appalachia had 
incomes below $2,000. Douglas (1967) found in his work related to mental 
health of rural youth that one-third of all persons living on farms and 
one-fourth of the rural non-farm population were families with cash 
incomes below established poverty levels. Udall (1967) indicated in his 
report at the National Outlook Conference that one-third of the rural 
population accounted for one-half of the population designated as living 
in poverty. Udall also reported that the median family income for rural 
Negroes in fourteen Southern states in 1966 was less than $1,500, and 
in the Southwest over one-third of the Spanish-surname families lived 
below the poverty level. 


Jenkins (1963) and Taylor and Jones (1963) reported that rural 
income per capita did not match urban income per capita, and that as a 
result rural residents were disadvantaged in terms of the larger society. 
Jenkins further stated that as a result of this and other factors, there were 
many rural children who had extremely limited and even impoverished 
social contacts limiting opportunities to learn, which resulted in an in- 
creased incidence of cultural and mental retardation in the poorer rural 
areas. Baughman and Dahlstrom (1968), in their study of a typical rural 
Southern community, reported that people were moving away from de- 
pendence on the land, although farming remains important. In rura 
America, non-farm jobs have not developed rapidly enough to meet the 
needs of the people, and consequently the youth of that area must seek 
their future elsewhere. 

On all socioeconomic levels, children may be hampered by charac- 
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teristics resulting, directly or indirectly, from their parents’ situations. 
Thurston, Feldhusen and Benning (1964), in a study of factors affecting 
behavior of rural and urban youth, found that parents with low occupa- 
tional and educational levels were likely to have children with excessively 
aggressive behavior. Non-skilled, low-paying jobs, with their consequent 
fatigue, boredom, and lack of personal reward, were shown to exacerbate 
existing personality problems within the home, and, hence, to directly 
influence the home atmosphere. Thurston concluded on this basis that 
living in rural areas with low income seems to be particularly conducive 
to the development of “disadvantagement.” 

McMillion (1966) discovered that rural students from different socio- 
economic levels placed different connotative values on selected words and 
phrases. For example, the word leadership was valued more highly by the 
socioeconomic disadvantaged group pupils. The word cooperation was 
valued more highly by the middle socioeconomic group of pupils than by 
the highest socioeconomic group of pupils. 

Bass and Burger (1967) pointed out that the American Indian is 
the most disadvantaged rural group. In comparison to the general popula- 
tion, their income was only two-ninths as much; their unemployment rate 
was almost ten times greater; their life expectancy was seven years less; 

again as many of their infants died; their school dropout rate was 
almost double that of the general population, and they had less than 
half the years of schooling. 

When they studied the relationship between family variables and 
children’s intellectual performance, Baughman and Dahlstrom (1968) 
discovered the relationship between family variables and children’s intel- 
lectual performance was closer among the white pupils than among the 
Negro pupils. Generally, however, the study showed the intellectual pro- 
ficiency of a child was positively correlated with the socioeconomic status 
of his family. 

The studies reviewed indicate there is a definite relation between 
socioeconomic levels and educational progress in rural America. This same 
relationship exists for the general population of the country and is found 
in urban and suburban areas. For rural, urban and suburban U.S. A., 
economic status rises, educational achievement levels rise. 


Aspirations 


The research reviewed indicates that there are differences in the 
Occupational and educational aspirations of rural youth in comparison to 

aspirations of other youth and that aspirations may differ among dif- 

t types of rural youth. 

Ackerson (1967) reported that only about one-tenth of rural young 
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people would be able to remain successfully in farm life, yet the other 
nine-tenths were not prepared to find other types of employment in the 
environment of an urban community. Sewell (1963) confirmed the find- 
ings of previous educational planning studies which indicated that occu- 
pational choices of youth were related to residence. 

Rural youth on the whole receive less preparation for successful entry 
into the world of work and have a much smaller range of occupational 
aspirations. Haller, Burchinal and Taves (1963) compared rural to urban 
youth; they discovered that the college and occupational aspirations of 
rural youth were lower, that they had more trouble getting a permanent 
job, and that their jobs were not as skilled or highly paid as those of 
non-rural youth. Taylor and Jones (1963) found that in the rural en- 
vironment the range of occupational types was limited and that there 
were few if any, white collar jobs represented. The youth from rural areas 
may not develop attitudes, desire, or motivation to achieve occupational 
success in white collar jobs. 

Taylor and Jones (1963) further pointed out that in low-income 
areas, students’ peer group experiences are homogeneous in terms of social 
class; thus, these experiences minimize the students’ introduction to dif- 
ferent values and traditions. Therefore, behavior of rural youths exhibits 
greater conformity to the cultural values of their own subcultural reference 
group. This conformity is reflected in the educational and occupational 
aspirations of low-income rural youth. 

There is some indication that rural students from the various ethnic 
minority groups have lower occupational and educational aspirations than 
other rural youth. Drabick (1963) in his study of the aspirations of 
Negro and white students of vocational agriculture in North Carolina 
found that the Negro, male, senior agriculture student did not desire or 
expect to enter occupations with as great prestige as did white students. 
The same relative relationships existed for the educational plans of the 
two groups. Crawford, Peterson and Wurr (1967) found that the Indian 
student had lower aspirations than other students. Henderson (1966) re- 
ported that nearly 50% of unemployed Mexican American adults were 
not looking for work. 

Socioeconomic status of rural youth plays an important part in as- 
pirations. Taylor and Jones (1963) reported that when emphasis on formal 
education was lacking, as in low-income farm families, the youth involv 
did not perceive education as a dominant value in American culture an 
consequently were not motivated to obtain education. Sperry (1965) found 
a relationship between standards of living and interests of rural youth. 
Youth from high and middle economic status group backgrounds display 
more scientific and musical interest than youth from lower standard-of- 
living backgrounds. Sperry felt that scientific interest was explainable in 
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that certain cultural advantages, generally more prevalent among high 
and middle status groups, were known to stimulate an interest in dis- 
covering new facts and solving problems. Likewise, there might be greater 
emphasis and resources expended on musical interests among families 
with higher standards of living. Sperry (1965) and Taylor and Jones 
(1963) indicated that rural youth from a higher socioeconomic level had 
higher educational aspirations and took greater advantage of educational 
opportunities than rural youth from lower socioeconomic levels. 

Rural Negro youth were found by Ohlendorf and Kuvlesky (1967) 
to be more oriented toward attaining higher levels of education than rural 
white youth. Negro boys and girls had higher educational expectations 
than white boys and girls had. Ohlendorf and Kuvlesky also discovered 
that much larger proportions of the Negroes desired and expected to do 
graduate work, while larger proportions of the whites desired and expected 
to terminate their education after graduating from high school. These 
findings are particularly interesting when compared to the 1963 results 
reported by Drabick in his study in North Carolina, which showed lower 
educational aspirations and expectations among Negro students than 
among white youth. The explanation for the contradiction is not certain, 
but it may be due to more realistic aspirations among the white youth or 
to the differences in the two populations studied, or to significant social 
changes during the years which elapsed between the Drabick study and 
the work done by Ohlendorf and Kuvlesky. 

There does not seem to be complete agreement on educational aspira- 
tions and practices of farm and non-farm youth. Sperry (1965) and 
Drabick (1963) reported that non-farm rural youth placed higher values 
on education and more of them attended college than did farm youth or 
those taking vocational agriculture classes in high school. Slocum (1966) 
did not find this true in his research in the State of Washington. He found 
that more farm boys (80%) than non-farm (72%) aspired to attend 
college. The proportion of farm to non-farm girls with college aspirations 
was equal. The differences in findings may be due to the higher socio- 
economic level of the farmers in the Northwest section of the United 
States since Slocum also found that the educational aspirations and ex- 
pectations of students tended to be positively related to the economic and 
social status of parents. 

Rural schools apparently have done very little to help students 
change these aspiration patterns. Severinsen (1967) indicated that one 
of the problems of rural youth stemmed from lack of adequate occupa- 
tional information. This study concluded that significant improvements 
in vocational knowledge among high school students resulted when in- 
creased informational services were provided. Lindstrom (1965) found 
that rural schools gave no assistance to students who were migrating to 
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the cities to work. He concluded that it was a mistake for youth just fin- 
ishing high school, especially the younger ones and females, to migrate 
to the city to seek jobs. Rather, it would be better for these young people 
to remain in the community to get some job experience related to the 
kinds of jobs available in the city or to get advanced training of the type 
demanded by these occupations. 


Attitudes 


Disadvantaged rural children bring certain attitudes to school which 
seem to be associated with their home life and economic status. Crawford 
(1967) said in his discussion of the Chippewa Indian that true poverty 
involved something much more significant to children than just low in- 
come. Poverty involved certain prevalent attitudes which affected the 
children as they grew up. One common attitude which the rural poor 
have is the feeling that they are trapped and that there are no promising 
choices open to them in solving their problems. This attitude carries over 
into school activities. Palomares and Cummins (1968) pointed out that 
the Mexican American population in a small border town of Southern 
California tended to see itself in a less favorable way than the normative 
population. The self-concept of Mexican Americans seemed permeated 
with feelings of inadequacy and low self-esteem, both at home and at 
school. A weakness of this study, pointed out by the authors, was that 
the tests used the norms as a control population rather than comparing 
the attitudes of the Mexican Americans in the community with Anglos 
or others in the same area. Low self-esteem may well have been a charac- 
teristic true of the entire community rather than just of the Mexican 
Americans, 

In a study of achievement among Mexican Americans, large numbers 
of whom are rural residents, Mayeske (1967) examined three aspects of 
student maturation and attitude in relation to achievement: (1) student’s 
interest in school and persistence of reading outside school; (2) students’ 
self-concept, especially with regard to learning and success in school; and 
(3) students’ sense of control of the environment. Mayeske found that 
the attitudinal item most highly related to achievement test scores at all 
grade levels was students’ belief in their ability to control or influence 
their environment. The differences in achievement associated with the 
belief in one’s ability to control his environment remained even after dif- 
ferences in home background were taken into account. Coleman et al. 
(1966) reported similar findings for a more broadly representative popula- 
tion. Mayeske suggested that for children who have experienced an un- 
responsive environment, a change in their ability to influence their 
environment might lead to increased achievement. 


Sperry (1965) pointed out that there were sex differences in the 
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educational attitudes of rural children. Girl’s attitudes toward an educa- 
tion were more favorable and were more similar to those their parents 
hoped they held than were boys’ attitudes. Sperry also reported that rural 
youth received more “strong urging” to continue their education from 
their mothers than from their fathers. 

Educators and lay community persons often have different attitudes 
toward rural students from different ethnic backgrounds. Anderson and 
Safar (1969) reported a sharp disparity between school board members’ 
and administrators’ perceptions of the adequacy of existing school programs 
for Anglos, Spanish-Americans, and Indians. School board members in- 
terviewed were quite satisfied with existing programs and felt the programs 
were equal for all the groups of children. School administrators felt that 
Spanish-American and Indian students were not encouraged as much as 
their Anglo classmates. 


Educational Achievement 


All groups of disadvantaged rural students are characterized by poor 
educational achievement. The United States Department of Agriculture 
(USDA, 1967) reported that about 19% of the rural youth had fallen 
behind at least one year and that only 12% of urban youth were that 
educationally retarded. 

Raughman and Dahlstrom (1968), in their study of a rural Southern 
community, found that only white girls consistently measured up to na- 
tional norms on academic achievement scores. The younger white boys 
compared favorably with the girls, but beyond age eleven their scores 
dropped below both the national norms and white girls’ achievement 
scores. In this study the Negro boys made consistently poor showings; 
this was apparent at even the younger ages. Negro girls achieved below 
norm levels except in spelling, but achieved significantly higher than the 
boys in spelling and mathematics. Silvanoli and Zurkowski. (1968) found 
that young disadvantaged Arizona Indian children did well in spelling. 
A USDA report (1967) indicated that in the five Southwestern states, 16 
and 17 year olds with Spanish surnames were far below the national norm 
of educational achievement. This was especially true for rural Spanish 
surname youth. 

The most deprived and sometimes least visable member of American 
rural society is the American Indian. Bass and Burger (1967) reported 
that a comparison between American Indian and Anglo students, con- 
trolled for geographic isolation factors, showed the schooling gap to be 
attributable to cultural differences rather than ruralism. 

A number of studies have shown that the Indian student is nearly 
equal to the Anglo at the pre-school and primary levels, but as he pro- 
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gresses through grade levels he falls behind. The Ohannessian (1967) and 
Bass and Berger (1967) studies are good examples. In each it was found 
that as Indian students went up the school ladder, their achievement 
seemed to fall progressively behind the school norms. Bass and Burger 
found that the situation worsened as the Indian child progressed from the 
sixth to the twelfth grade. 


Palomares and Cummins (1968) found the same to be true with the 
small town Mexican-American population, which was characterized by a 
progressive drop in achievement throughout the grades. Mexican Americans 
were normal in achievement at first and second grade, but one grade be- 
hind by sixth grade. The investigators found the same situation in relation 
to perceptual-motor development of the Mexican American children. This 
progressive deficit in perceptual-motor development was attributed to both 
home and school environment. Palomares and Cummins found an almost 
identical situation in studies conducted at Wasco and San Ysidro, California. 


Statistically significant differences in IQ scores for rural Indian, 
Mexican American, and Anglo students were found by Anderson (1969). 
In a study in rural New Mexico he found that 55% of the Anglo students 
had high level IQ scores. 18% had median level scores and 27%, low level 
scores. For the Spanish American pupils the high level, median level, and 
low level percentages are 33, 26 and 41 respectively; for the Indian pupils, 
the percentage of students whose IQ scores fell into each category were 
18, 9, and 73 respectively. The same type of distribution was found for 


achievement scores among the three groups at the elementary and high 
school levels, 


Baughman and Dahlstrom (1968) found in their study of a Southern 
rural community that white girls and boys had the highest ability levels, 
but white girls were highest in achievement scores. Negro girls scored 
about one standard deviation below the national norms on both ability 
and achievement scores. The Negro boys were equal to the Negro girls 
on ability scores at lower ages but were lower as they progressed in years. 


It should be remembered, however, that it is very difficult to measure 
either IQ or achievement accurately with tests that are culturally biased. 
Wax and Wax (1964), in working with Indian children, found that pro- 
ficiency in English was essential for scholastic or academic achievement. 
For this and other reasons, existing methods of measuring achievement and 
academic ability are biased against the child whose first language is not 
English. Henderson (1966) further substantiated this finding when work- 
ing with Spanish-speaking students. It seemed that lack of training and 
cee were seen as barriers to advancement more often than was ethnic 
identity. 


Language difficulty is also a problem for English-speaking disadvan- 
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taged rural people who use a non-standard form of English as their first 
language. Skinner (1967) reported that much of the illiteracy among the 
Appalachian people was really the result of failure to supply the children 
with means of learning to use standard English effectively. A language 
system is imposed upon them which is totally alien to their experiences. 
Alien reading and writing codes are incorporated into it. Skinner further 
stated that when pupils could not meet the demands to learn the language 
system, they were labeled as problem leaders and illiterates. He said the 
children were not illiterates, but they appeared to be so when measured 
according to the middle-class language system. 


Educational Retention 


The research literature on retention rates for individual groups of 
rural youth is sparse and not very clear. Available figures make it evident, 
however, that the dropout rate of rural students is a serious problem. 
The USDA report (1967) indicated that although the dropout was a 
nationwide liability, failure occurred more often in the South than the 
North and West, and more often in rural than urban areas. Lamanna 
and Samora (1967) reported that urban residents were much more likely 
to stay in school than rural non-farm or rural farm residents. 


The dropout rate among American Indians is extremely high. Craw- 
ford (1967) reported that in secondary school the Indian pupil typically 
begins to show evidence of scholastic and personal problems. His atten- 
dance is often erratic. According to Bryde (1967), the national dropout 
rate for Indian students from the eighth to the twelfth grade was 60%. 
A dropout rate of that size indicates not only scholastically but also socially 
maladaptive behavior by the majority of Indian students. Wax and Wax 
(1964) felt that socioeconomic problems had much to do with the high 
dropout rate of Indian students. These investigators found a higher fre- 
quency of dropouts among high-school Indian students when the father 
was irregularly employed than in those families in which the father had 
steady employment. The study also indicated that a persistently increas- 
ing difficulty with the English language caused a lag in comprehension 
and eventually resulted in termination of the student’s education. 


Although the dropout rate is high among other groups of rural youth, 
it is a particularly serious problem for children of migrant workers. 
Soderstrom (1967) in a study of migrants in Idaho found that they had 
a dropout rate four times greater than the Idaho statewide average. Soder- 
strom indicated that the characteristics of migrants which might cause 
the dropout problem were limited cultural environment, high mobility, 
and language difficulties. 
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Curriculum 


It was indicated in many of the studies reviewed that the curriculum 
was not adequate to prepare rural students, especially those from disad- 
vantaged homes, for higher education or employment. Mercure (1967) 
reported that most rural schools did not have the resources or creativity 
to develop programs designed to enable rural minority youth to relate to 
the broader United States environment. He felt that consolidated rural 
school systems should work out more appropriate programs and curriculum 
for these students. 

Jenkins (1963), Lindstrom (1967) and Ohlendorf and Kuvlesky 
(1967) felt that vocational education programs for youth should be up- 
graded. Jenkins noted that a major need in dealing with rebellious rural 
youth was to give them a stake in the social order by helping them acquire 
vocational skills. He reported that their schooling was too much limited 
to the academic. The vocational training available to rural youth was too 
often limited to training in farming, which could not meet the need of the 
majority of rural youth who must move into industry. Lindstrom found 
that most of the rural youth migrating to the city had no specific training 
in high school to prepare them for those jobs in the city that were likely 
to be offered to them. 

Ohlendorf and Kuvlesky (1967) reported that large numbers of rural 
youth who reside in low-income areas, especially Negroes, want and ex- 
pect to attain higher levels of education. If rural students are to be able 
to meet these expectations, then more adequate curriculum and facilities 
must be provided. Otherwise, opportunities for these youth to participate 


fully in society will continue to be limited by their disadvantaged educa- 
tional status. 


Cultural and Social Status 


There are a number of major cultures represented by disadvantaged 
rural youth. Some of the most distinctive and well-known include the 
rural Negro, the mountain white, the American Indian, and the rural 
Mexican American. Each group tends to limit the experiences of the child 
primarily to culture of the particular group. Henderson (1966) reported 
that the rural Mexican American youth tended to associate only with 
persons within his own group, thus further limiting his cultural expe- 
riences. Jenkins (1963) pointed out that the limited range of contacts 
available to the rural child had definite effects. That child’s opportunity 
for learning is likely to be more restricted than either the advantaged 
rural youth or the urban youth. The rebellious rural youth can not melt 
into the “crowd” available to urban youth. In the continuity of contacts 
which is characteristic of his life, the verbal assurance of the rural youth 
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becomes less important and his performance becomes more important. It 
is not easy for him to substitute assurance for performance. 

Weller (1965) reported apparently recent attitudinal changes in the 
mountain people of the Appalachian region; they seemed to be recogniz- 
ing the importance of their children’s obtaining an education. They still 
feared, however, that education would separate the children from their 
families and destroy the common reference group. Weller indicated that 
in adult-centered mountain families, separation between adults and chil- 
dren began about the time children entered school and from that time 
increased rapidly. The mountain white people tended to resist help from 
organizations or individuals other than relatives. Crow, Murray, and 
Smythe (1966) found that because of this resistence, a great many people 
were not very interested in schools or schooling. This adult commitment 
to independence was easily adopted by the children who often mimicked 
it in their resistance to the authority of the teacher or the policeman. 

Another characteristic which Weller found in the culture of the moun- 
tain youth was their inability to concentrate for long periods of time on 
a particular subject. This inability combined with the lack of any home 
emphasis on the value of learning that could not be applied immediately 
hindered the mountain child in his education. 

The Mexican American student in the Southwest is another example 
of youth torn between two cultures. A great many of these young people 
are becoming Americanized and integrated into the mainstream. Forbes 
(1967) reported that in many rural areas of the Southwest, most Mexican 
American adults could be described as belonging primarily to the culture 
of northern Mexico. The Spanish language was still favored over English 
in the homes. Often the young Mexican American student entering a 
completely “Anglo type” school is torn between the culture of his parents 
and the middle-class orientation of the classroom. Mayeske (1967) stated 
that achievement was highest for Mexican American students when English 
was spoken in the home. The use of a language other than English de- 
tracted from the achievement of the youth. Mercure (1967) also reported 
that few students from small Spanish American villages participated in 
extracurricular activities at consolidated high schools. 

Students with the greatest cultural differences to overcome are Amer- 
ican Indians. Gaarder (1967) indicated that more than half of the Indian 
children he studied used an Indian language. This lack of familiarity 
with English made it very difficult for them early in school to become a 
part of middle-class school culture. In addition, the problem of improving 
education for the Indian student is complicated not only by great cul- 
tural differences between him and the dominant society, but also by ex- 
treme cultural differences among the Indians themselves. There are 
several Indian subcultures, Ohannessian (1967) reported some 13 large 
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and extensive language families among the American Indians. These 
language subdivisions tend to differentiate the various groups. Ohannessian 
also noted that some Indians appeared to be actively striving for assimila- 
tion and did not regard the majority culture as one imposed upon them. 
Others actively or passively rejected it. Aurbach (1968) reported that 
more than 50% of all Indian children dropped out of schools in the late 
1950’s, and that among the major reasons were the cultural differences 
in educational expectations between Indians and other groups. Bryde 
(1967) discovered that when comparing personality variables among white 
and Indian groups, 26 of 28 personality variables were significantly dif- 
ferent. On each of the measures, the total Indian group revealed greater 
personality disruption and poorer adjustment. Bass and Burger (1967) 
reported that the Indian student finds himself at a great disadvantage be- 
cause of cultural differences. The authors indicated that the mere fact 
of conflict between the language at home and the language at school 
caused a high failure and dropout rate. 


Conclusions 


A review of the available research relevant to the characteristics of 
disadvantaged rural students shows them to be affected in seven general 
areas. The low socioeconomic status of large numbers of noncorporate- 
farm rural families is a characteristic of prime importance, particularly 
in view of the relationship between economic status and school achieve- 
ment for rural as well as urban children. In addition, the educational 
and occupational aspirations of rural students appear to be negatively 
affected by their low economic status and possibly further depressed by 
factors related to geographic isolation. Many rural young people who 
will not be able to make a satisfactory living by farming do not aspire to any 
higher skilled urban occupations nor to the educational level which would 
prepare them for such work. Possibly related to socioeconomic status are 
other attitudes found among rural children which may further hinder 
their progress: low self-esteem, feelings of helplessness in the face of seem- 
ingly unconquerable environmental handicaps, and impoverished con- 
fidence in the value and importance of education as an answer to their 
problems. All of those attitudes understandably may contribute to the 
child’s failure to benefit from his schooling. 

For the rural child, these three characteristics—socioeconomic status, 
low level of aspiration, and attitudes non-supportive of educational pro- 
gress—are linked with a fourth, educational achievement, to form part 
of a cycle of cause and effect the mechanisms of which available research 
does not yet permit us to specify. Disadvantaged rural students, like their 
urban and suburban counterparts, are characterized by achievement levels 
below national norms. Moreover, the mobility of rural and urban disad- 
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vantaged populations make it difficult to determine whether rural student 
achievement levels are more seriously retarded than urban disadvantaged 
student levels. Accompanying these characteristics is a pattern of slightly 
higher dropout rates, which indicates that educational retention is a more 
serious problem in rural than in urban areas. 

Studies which survey these characteristics of rural youth also reveal 
that curricula in rural schools are frequently inadequate for and irrele- 
vant to the needs of these students. Several writers noted that curricula 
should be more meaningfully related to the financial and occupational 
realities of the students’ lives. Finally, available research indicates a wide 
range of cultural and ethnic groups among disadvantaged rural youth. 
Children from each distinctive group tend to be limited in the breadth of 
their cultural experiences, and thus find it difficult to adapt to educa- 
tional environments which tend to follow mores and values drawn from 
the dominant culture and broader frames of cultural reference. 

Perhaps the two primary conditions vital to any consideration of dis- 
advantaged rural youth are isolation and poverty. The former is of special 
concern since it is perhaps the one characteristic most peculiar to the 
noncorporate-farm rural child, and one which may make the effect of 
other disadvantages more severe. Not only does geographic isolation help 
to confine the child’s cultural experience to his own group, but also this 
relative isolation may well make it more difficult for the school to capitalize 
on characteristics which could be turned to the pupil’s advantage in a 
setting where richer and more varied educational resources were available. 
Poverty, likewise, is a rural condition of primary importance. It is endemic 
to a large segment of the rural population not directly involved in cor- 
porate farming. Although poverty is not incompatible with high level 
academic achievement, research consistently shows a high degree of asso- 
ciation between poverty and low level educational progress. 

The survey of available material indicates that studies of rural chil- 
dren are of about the same level of research as those studies directed at 
disadvantaged children in general. Emphasis is placed on negative charac- 
teristics or deficits as compared to some assumed norm for the total popu- 
lation. Although some subgroups have been identified for study, the 
tendency in this research is to treat rural youth as if they were a mean- 
ingful and integral group for study. Since they tend to be removed from 
the proximity of major research centers, this population has not been the 
subject of intensive longitudinal and developmental process investigations. 
Qualitative studies of function and process are absent. Status studies 
have dominated. Such studies are not helpful in terms of the education 
of such children. We know they are poor; we know they are disadvan- 
taged. We know that they are deficient in some of the areas where more 
privileged students are strong. What we need is examination of critical 
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issues having to do with fundamental relationships between the functional 
characteristics of rural disadvantaged children and educational develop- 
ment. Too often the analysis of educational disadvancement tends to be 
approached quantitatively. This work contributes to classification and 
serves some administrative functions, but before we can develop really 
effective correctional compensatory and developmental programs which 
circumvent some of the handicaps, which provide alternative routes to 
learning, or which build upon special characteristics, we need more de- 
tailed appraisal research with a greater qualitative emphasis. 


The studies examined in this chapter also serve to emphasize another 
weakness in current research on disadvantaged children. Any large-scale 
quantitative approach to the study of the status of conglomerate groups 
leads to excessive generalizations. Important variations within subgroups 
are often lost in this research. The functional relationships between and 
among status and process variables are seldom studied. For purposes of 
effective educational improvement and better understanding of the de- 
velopmental and educational issues involved, there is crying need for more 
concern with individual and subgroup differences, in function, with de- 
velopmental and learning environments, with differential facilitating and 
interfering processes and the relationship between such variables. 


Finally, it is important to evaluate the tendency to view these prob- 
lems in isolation from the main currents of educational research and 
development. The movement of sub-populations in the United States 
today is such that rural areas feed their problems and special character- 
istics into urban suburban populations. Although the problems of rural 
disadvantaged children, as this survey has shown, are not unlike those 
of other youngsters, rurality does impose certain conditions which ex- 
acerbate educational problems. Future research relating to disadvantaged 
rural students must be coordinated with other major educational research 
programs in the nation. Educators can no longer afford to study each 
segment of the society in isolation from any other. The problems and 
their solutions are overlapping and interrelated. 
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5: SOCIAL CLASS AND THE 
SOCIALIZATION PROCESS 


EDWARD ZIGLER* 
Yale University 


The literature on social class differences 
in socialization within our own society has much in common with the 
cross-cultural literature dealing with societies throughout the world. Much 
of it is based upon descriptive accounts of widely varying adequacy. 
These reports have engendered stereotyped views of the behavior of each 
class and have fostered a modal-man approach to social class member- 
ship. Thumbnail sketches of what a lower-class or middle-class person 
“Gs like” are a familiar part of this literature; see, for example, Cavan’s 
(1964) sketch of each of six classes from upper-upper to lower-lower. 


the word-pictures employed to describe a social class personality inevitably 
tend to emphasize the homogeneity of behavior within a class and the 
heterogeneity across classes. Because it is irrelevant to the main purpose 
of such writing, little attention is ordinarily given to discussions of vari- 
ability within a class or of similarities across classes. In discussing the 
class personality profiles that have been drawn, Clausen and Williams 
(1963) pointed out that, although they are largely unsubstantiated, these 
profiles resulted “in some remarkably tenacious and persistent stereo- 
types.” 

The readiness of many writers to treat social class differences in 
this way is somewhat surprising in light of the very vagueness of the 
social class concept. A modal-man approach and emphasis on inter-group 
variation are plausible when applied to discrete societies that have clearly 
defined membership and are distinguishable from other groups in many 
obvious ways. Initially, they are much less plausible when applied to 
subgroups of one society having uncertain membership with much inter- 
action and mobility among them and sharing a common core of history 
and values. 

Conceptualizing social classes as discrete groups, each with its own 
subculture, has frequently been objected to at a theoretical level (see 
Brown, 1965, chapter 3; Cavan, 1964). Objections have also been based 


*Dr Boyd McCandless, Emory University, served as a consultant to Dr. Zigler on the 
Preparation of this chapter. Support for the preparation of this manuscript was supplied 
in part by U.S.O.E. contract #6-10-240 with Yeshiva University. 
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on methodological considerations—for example, the lack of any means 
of dependably sorting people into classes in the same way that they can 
be sorted into societies. (See Hoffman and Lippett, 1960, and Miller 
and Swanson, 1960, in addition to Brown and Cavan for discussions of 
the measurement problem involved in social class categorization.) It is 
somewhat reassuring to learn that 19 indices of socioeconomic class 
membership which have been used by various investigators are highly 
correlated, enough to justify speaking of a single dimension (Kahl and 
Davis, 1955). However, these indices are not identical and the magnitude 
of the relationship between social class membership and particular atti- 
tudes and behavior depends upon the choice of social class index. Regard- 
less of how accurately or consistently a position along a social class 
dimension is measured, there remains the question of whether this timen- 
sion should be divided into discrete classes. Certainly there are cultural 
differences associated with status within every United States community 
and attention is called to them by some of the indices used for social 
class—occupation, for example, and the specific occupational distinction 
of white-collar versus blue-collar. But to think that cultural variation is 
found only among discrete groups, but not among levels on a continuously 
varying dimension, is itself an over-simplification which results from the 
origin of the culture concept in the ethnography of discrete societies. 


A flexible conception of socioeconomic status which allows it to be 
treated as either a continuous dimension or a set of categories holds the 
greatest promise of advancing human understanding. Distinctive values 
may be expected from each treatment. Regarding status as a continuous 
dimension facilitates relating it to a number of important dimensions. 
Breaking it into categories, however, seems especially valuable for calling 
attention to distinctive implications of variations in social status which 
are somewhat independent of the main dimension. An example of this 
treatment of socioeconomic status is Miller and Swanson’s account, de- 
scribed later in this chapter, of entrepreneurial versus bureaucratic in- 
tegration settings. 

Resemblance between intra-societal and inter-societal studies is also 
found in the problems of interpretation which they engender. Both are, 
by their nature, correlational rather than experimental and share in the 
difficulties of non-experimental research. Barry (in press) pointed out 
that two main types of interpretation have been employed to explain the 
cross-cultural findings, the sociogenic and the psychogenic. This is also 
true of the social class findings. Sociogenic explanations have viewed the 
adult personality, modal to a class, as being behavior necessary to success- 
ful performance of the role of class member. In this context child-training 
practices, if considered at all, are viewed simply as one expression of that 
modal personality. Within psychogenic explanations the assertion is made 
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that child-rearing practices produce modal personality characteristics which 
then have a constraining influence on adult behavior. (See Gold, 1958, 
for an interesting example in which the sociogenic and psychogenic inter- 
pretations are pitted against each other in an effort to explain social class 
differences in aggression.) The studies encountered in support of this 
thesis are quite varied. All too rare are studies in which class was related 
to child-rearing practices which in turn were related to later behavior 
of the individuals who had actually been subjected to these practices. 
(See Hoffman, 1966; Miller and Swanson, 1960; and Sears et al., 1957 
for examples of this approach.) 


Intra-Societal Variation in Behavior 


This review is limited to those studies providing evidence of social 
class differences in general attitudes and behavior and to the theoretical 
important studies dealing with social class differences in child rearing. 

As one would expect, social class has a number of economic and socio- 
logical correlates. For example: 1) The middle-class family tends to be 
more stable than the lower-class family and to be nuclear rather than 
extended; 2) security of husband’s employment varies with social class, 
as does the likelihood that the wife will not need to be employed; 3) in 
high-status groups, husbands were found to make more decisions than the 
wife: in the middle-status (roughly middle-class) group, a high degree 
of equality in decision-making of husband and wife was found; and in 
the low-status group, the wife was found to be more dominant than in 
either the high- or middle-status groups (Blood and Wolfe, 1960; Olsen, 
1960). The finding of greater marital satisfaction, psychologically mea- 
sured, in middle-class than in lower-class women is probably related to 
these characteristics (Blood and Wolfe, 1960). 

Considerable evidence exists of class differences on such broad dimen- 
sions of behavior as quality of family relationships, patterns of affection 
and authority, parents’ conceptions of parenthood and their expectations 
for the child, children’s perception of parents, general expressive styles 
and modal reactions to stress. (See comprehensive reviews of this litera- 
ture by Clausen and Williams, 1963; Elder, 1968; and Miller and Swan- 
son, 1960). 

In a widely noted study of social class and parental values, Kohn 
(1959) found that parents of all social classes thought their children 
should be honest, happy, considerate, obedient, and dependable. However, 
middle-class parents were found to emphasize such internalized standards 
of conduct as honesty and self-control, consideration, and curiosity, while 
working-class parents emphasized qualities that assure respectability, such 
as obedience, neatness, and cleanliness. 
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Duvall (1946) found that working-class mothers elicited specific 
behavioral conformities from their children, whereas middle-class mothers 
focused on the child’s growth, development, affection and satisfaction. 
Middle-class parents were seen as having more acceptant, egalitarian re- 
lationships with their children and to be more accessible to the child than 
the parents in the working class (Maas, 1951). Although the working-class 
father was reported to be less available and accessible to the child than 
the middle-class father (Bronfenbrenner, 1961), working-class mothers 
were seen as expecting their husbands to be more directive and to play 
a larger role in the imposition of constraints (Kohn and Carroll, 1960). 
Boys from the middle-class were found to perceive their parents as more 
competent, emotionally secure, accepting, and interested in their child’s 
performance than did lower-class boys, with these class differences being 
greater for the perception of the father than of the mother (Rosen, 1964). 

Findings of this sort are not confined to children’s relations with 
their parents. Milner (1951) found that lower-class children were nore 
likely than middle-class children to perceive adults in general as pre- 
dominantly hostile. In studying retarded children drawn from the lowest 
segment of the lower socioeconomic class, Shallenberger and Zigler (1961) 
found that these children were characterized by an atypically high degree 
of wariness of adults and inferred that this wariness was due to social 
class experiences rather than to retardation per se. 

Social class differences in children’s general approach to problems or 
“styles” of life have also been discovered. Alper, Blane, and Abrams (1955) 
hypothesized that middle-class children would be more fearful of getting 
dirty while engaged in a finger-painting test than lower-class children. 
This hypothesis was generated from the view of Davis and Havighurst 
(1946) that middle-class as compared to lower-class children are sub- 
jected to earlier and more consistent influences which cause them to be 

orderly, conscientious, responsible, and tame” and from Ericson’s (1947) 
conclusion that middle-class children are “more anxious as a result of 
these pressures.” As predicted, the middle-class children showed a lower 
tolerance for getting dirty, for staying dirty, and for the products they 
produced while dirty. Somewhat related to this is the finding that among 
children and adults of the middle as compared to the lower class one en- 
counters a greater readiness to experience guilt (Miller and Swanson, 
1960; Zigler and Phillips, 1960). As Clausen and Williams (1963) pointed 
out, studies of this type (Davis, 1954; Green, 1946) view the working- 
class child as “better adjusted”—free of the excessive guilt, repressed 
hostility, and driving anxiety of his middle-class counterpart. 

; Contrary to this view, several studies measuring aspects of personality 
which seem relevant found “better adjustment” in middle-class children 
(Clausen and Williams, 1963; Burchinal, Gardner and Hawkes, 1958; 
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Sewell and Haller, 1956). (The findings of Sears et al. 1957, that middle- 
class mothers are more permissive in their child rearing than lower-class 
mothers would generate a prediction opposite to that of Alper et al., 1965.) 
Miller and Swanson (1960) presented evidence that as a result of different 
child-rearing practices, middle-class children more readily employ repression 
as a defense, whereas denial is more characteristic of the lower-class child. 

Somewhat related to this conceptual-motoric dichotomy are the class 
differences that were found in the expression of aggression. Adults of the 
lower class were often found to vent their hostility in overt acts of aggres- 
sion against others, while those of higher status were more likely to turn 
their hostility inward, expressing it in self-deprecatory attitudes and 
suicide (Gold, 1958; Henry and Short, 1954; Zigler and Phillips, 1960). 
McKee and Leader (1955) found lower-class as compared to middle-class 
children to be both more competitive (defined by acts aimed at excelling 
or asserting superiority) and aggressive (defined by acts intended to injure 
another child). Davis (1943) also found aggression to be more apparent 
among the lower socioeconomic group. Even in fantasy, the lower-class 
child appeared to be more aggressive than his middle-class counterpart 
(Miller and Swanson, 1956, 1960). However, findings on class and ag- 
gression were not completely consistent. Maas (1954), for example, did 
not find that lower-class adolescent boys were consistently more aggressive 
than middle-class boys, and Body (1955) found more aggressive behavior 
in a middle-class than in a lower-class nursery school. 

Social class differences in achievement, independence, and conformity 
are also evident in the research, Rosen (1956) found that for the middle- 
class as compared to the working-class child there was more emphasis on 
independence in early childhood, higher expectations associated with school 
performance, a greater belief in the possibility of success, and a greater 
willingness to pursue those activities that make achievement possible. A 
greater degree of internalization of achievement striving among middle- 
class as compared with working-class high school students was found in 
two related experiments by Hoffman, Mitsos and Protz (1958). ‘Thompson 
(1959) inferred that there is more conformity in middle- than in lower- 
class adolescents; his conclusion was based on Mussen and Kagan’s (1958) 
finding that conformity is positively related to punitive and restrictive 
child rearing and Psathas’ (1957) evidence that this type of child rearing 
characterizes the middle rather than the lower class. 

_ In view of the importance of intellectual factors in determining the 
ultimate level of social and personal adjustment, tudents of socialization 
should be especially interested in the repeated finding that middle-class 
children average higher than lower-class children on most general tests 0 
riclligence as well as classroom indices of school achievement. This 
inding raises a particularly thorny issue about relationships between ass 


91 


REVIEW OF EDUCATIONAL RESEARCH Vol. 40, No. 1 


and behavior. It is probably safe to assume that both the individual's 
class position and his intellectual level are important determinants of his 
general behavior. But since these two are substantially correlated, there 
is usually no way of knowing how much of the apparent dependence of 
any variable upon one of them should more properly be ascribed to the 
other. When data are gathered to provide this information, things are 
sometimes seen in a new light. Miller and Swanson (1960), for instance, 
gave such information about the relationship between class and several 
other variables, showing that the relationship is sometimes markedly al- 
tered when intelligence is controlled. 

Thus far most of the research on class and intellectual functioning 
was conducted within the framework of a commitment to the environ- 
mentalist position, and one can distinguish two versions of environment- 
alism associated with different types of research. One position assumes 
that the average level of intellectual functioning probably does not differ 
from one class to another and that the observed relation is an artifact 
of measurement, a product of the unfairness of intelligence tests for lower- 
posi etme (Davis, 1954; Eells et al., 1951; Haggard, 1954; Isaacs, 


Where “culture-fair” tests were constructed and applied, however, 
performance on them was found to be significantly related to social class 
membership. McArthur and Elley (1963) found that the culture-fair 
intelligence measure correlated with social status about +.22 to +.24 
compared with +.33 to +.34 for traditional intelligence tests. 

A quite different environmental position is that there are real class 
differences in intellectual functioning and that these are produced by 
class differences in environment. Environmental differences vary from 
very general and sociogenic to the specific and psychogenic or cognitive; 
for example, broad class attitudes towards intelligence and education (e.g, 
Toby, 1963), general child-rearing practices which favor one cognitive 
style rather than another (e.g, Witkin et al., 1962), specific types of 
class-related interpersonal communications which result in specific deficits 
in intellectual functioning (e.g., Bernstein, 1961; Hess and Shipman, 1965). 
Studies associated with this last and most specific example are especially 
promising and appear to fulfill Jones’s plea (1954) that investigators move 
from the assertion that the environment influences general intellectual 
development to the investigation of how particular events impinging on 
the child influence particular cognitive processes. (See Deutsch, Katz, and 
Jensen, 1968, for more complete reviews of types of studies noted immedi- 
ately above.) 


Intra-Societal Differences in Child Rearing 
Differences in behavior associated with social class membership are 


 ZIGLER SOCIAL CLASS AND THE SOCIALIZATION PROCESS 


well-documented. There is little agreement, however, on exactly why such 
differences exist. As noted above, these social class differences are often 
explained as resulting from child-rearing practices of the different social 
classes. This explanation generates the expectation that clear differences 
among the classes in child-rearing practices would be empirically demon- 
strable. Although some reviewers (e.g., Cavan, 1964) were able to abstract 
from a number of studies certain general differences in child rearing 
associated with social class membership, the student of socialization is 
doomed to disappointment if he expects to encounter a great deal of 
clarity on the relationship between social class and child-rearing practices. 
The contradictory and inconsistent nature of the findings in this area was 
emphasized by Clausen and Williams (1963). These reviewers argued that 
much of this inconsistency was due to the focus on specific infant and 
child care practices, often taken out of context, and that greater agreement 
is to be found when attention is shifted from the more specific, and per- 
haps more fleeting, to certain more general and enduring dimensions such 
as quality of family relationships and patterns of affection and authority. 
Even on these latter dimensions, however, agreement is not as great as 
one would expect. For instance, Green (1946) took a rather broad-gauged 
approach toward middle-class values, goals and child-rearing practices and 
concluded that they are such as should produce an anxiety-ridden, if not 
imminently neurotic, child. But this conclusion does not necessarily fit 
the facts; Sewell and Haller (1956, 1959), employing an equally broad- 
gauged approach, concluded that lower-class children are more anxious 
than middle-class children, although for reasons other than those which 
Green advanced to explain the anxieties of the middle-class child. 


| An early and well-known study of social class and child training was 
conducted in Chicago by Davis and Havighurst (1946). Examining prac- 
tices associated with feeding and weaning, toilet training, aggression con- 
trol, household chores, and techniques of discipline, Davis and Havighurst 
found that lower-class as compared to middle-class children were: breast 
more frequently, more often fed on demand, weaned later, started on 
‘toilet training later, and expected to begin helping in the home at a later 
ge. Middle-class as compared to lower-class children were less severely 
Punished for soiling after toilet training had begun and were more fre- 
quently permitted to “fight each other so long as they do not hurt each 
er badly.” Middle-class mothers were found to mention reward or praise 
Ore frequently as a means for getting children to obey. A general con- 
fusion was that the child-rearing practices of the middle class were 
riented around restraint and self-discipline, whereas those of the lower 
class were more permissive. 

_ This conclusion was challenged in a study conducted nine years later 
the Boston area (Maccoby and Gibbs, 1954; Sears, Maccoby and Lewin, 
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1957). No differences in feeding and weaning practices were found be- 
tween the two social classes. Middle-class parents completed their chil- 
dren’s bowel training later than lower-class parents, were less severe in 
their toilet training procedures, and were more permissive of their chil- 
dren’s aggression when it was directed toward other children or toward 
themselves. Among disciplinary techniques, scolding statements suggesting 
withdrawal of love were more frequent in the middle class, while physical 
punishment and deprivation of privileges were more common in the lower 
class. Middle-class parents were found to be more permissive of the 
child’s sexual behavior, and the relationship between father and child 
was found to be warmer in the middle- than in the lower-class home. The 
authors concluded that middle-class parents were generally more permis- 
sive, gentler, and warmer toward their children than were working-class 
parents. 

In attempting to explain the inconsistency between their Boston study 
and the Chicago study, Sears et al. (1957) asserted that the Chicago data 
could also be interpreted as showing greater permissiveness on the part of 
the middle- as compared to the lower-class mother if the behavioral con- 
sequences of each particular child-rearing practice were fully considered. 
However, Havighurst and Davis (1955) concluded that the disagreements 
between the two studies were substantial and important, and may have 
been due to inadequacies in the sampling procedures in both studies as 
well as to changes in child-rearing ideology between 1943 and 1952. 

In a third study, Littman, Moore, and Pierce-Jones (1957) examined 
the child-rearing practices of middle- and lower-class parents in Eugene, 
Oregon. Like the Boston study, but inconsistent with the Chicago findings, 
the Eugene data indicated no class differences associated with feeding 
and wearing practices. Also consistent with the Boston study were the 
findings that father-child relations were better in the middle than in the 
lower class, and that the middle-class parents were more permissive 0 
the child’s sexual behavior. However, in other child-rearing practices, the 
Eugene study supported neither the Chicago nor the Boston study, indicat- 
ing instead a much greater similarity in child-rearing practices in the two 
social classes. For example, in the Eugene study, no significant class 
differences were found in toilet training, aggression control, or techniques 
of discipline. 


In a thoughtful discussion of the findings of the Chicago, Boston, and 
Eugene studies taken in toto, Littman, Moore, and Pierce-Jones (1957) 
pointed out that a relatively small percentage of findings in any of the 
studies were statistically significant and that, of the significant findings 
many were inconsistent from one study to another; they concluded that 
there were probably no general or profound differences among classes in 
socialization practices. 
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In another effort to resolve the discrepancy between the Chicago and 
Boston studies, White (1957) compared the child-rearing practices of 
lower- and middle-class mothers living in the suburban area south of 
San Francisco. She found that middle-class as compared with lower-class 
mothers were more permissive in a number of areas of training, and con- 
cluded that her study showed more agreement with the Boston study 
than with the Chicago study. White suggested that the discrepancies be- 
tween her findings and the Chicago study were due to changes in child- 
rearing practices that had occurred in the time lapse between the two 
studies. 

However, White showed no vast class differences in child-rearing 
practices and the extent to which her study supported the Boston study 
is doubtful. Of the 17 variables on which White compared the findings 
of the Chicago, Boston, and San Francisco studies, there were 14 on which 
no statistically significant differences between the classes were found in 
the San Francisco study. Of the three variables with significant differences, 
one was in agreement with the Chicago study and another was in agree- 
ment with both the Chicago and Boston studies, It thus appears that the 
bulk of the agreement between the San Francisco and Boston studies con- 
sists of finding that the social classes did not differ on a sizable number 
of child-rearing practices. Rather than supporting the conclusions of 
either the Chicago or Boston studies, then, the San Francisco study lends 
further credence to the Littman et al. conclusion that class differences in 
child rearing are smaller than would be expected from either the Chicago 
or Boston study. 

Another important investigation of child-rearing differences between 
middle and lower classes was the large-scale study conducted in Detroit by 
Miller and Swanson (1958). In this, as well as in a later investigation 
(1960), Miller and Swanson advanced the argument that a variety of 
changes in our society (including those in immigration patterns, ratio of 
urban to rural dwellers, and the general nature and complexity of our 
„ economic institutions) has changed the meaning of social class member- 
ship. As a result of such changes, social class membership no longer 
implies any underlying set of values, attitudes, goals and life styles. 
Homogeneity, they argue, is found instead in what they call an “integra- 
tion setting” which cuts across social class lines. Thus, to Miller and 
Swanson, child-rearing practices are not directed as much toward incul- 
cating values and behavior germane to the social class as toward developing 
a personality consonant with success in the family’s particular integration 
Setting. 

Miller and Swanson conceptualized two types of integration setting: 
the entrepreneurial and the bureaucratic. Membership in the entrepre- 
neurial setting is characterized by involvement in an economic organization 
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with the following features: small size, a simple division of labor, a 
tively small capitalization, and provision for income mobility through 
risk-taking and competition. The social situations encountered in such 
setting are referred to by Miller and Swanson as “individuated” since 
they tend to isolate people from one another and from the controlli ng 
influence of shared cultural norms. According to Miller and Swi J 
(1958, p. 57), “Children reared in individuated and entrepreneurial homes- 
will be encouraged to be highly rational, to exercise great self-contro 
be self-reliant and to assume an active, manipulative stance toward t 
environment.” tn 
Bureaucratic families are considered to be typically involved in ar 
economic setting characterized by substantially capitalized large organiza- 
tions employing many kinds of specialists. For the bureaucratic fam 
income is in the form of wages or salary and mobility comes thro 
specialized training rather than through success in taking risks. 
families are viewed as being involved in a welfare bureaucracy in wh 
the organization provides support in meeting their personal crises 
offers continuity of employment and income despite fluctuations in 
business cycle. According to Miller and Swanson (1958, p. 58), “Chil 
reared in welfare-bureaucratic homes will be encouraged to be a 
dative, to allow their impulses some spontaneous expression, and to seek 
direction from the organizational programs in which they participate.” — 
Miller and Swanson examined child-rearing practices as a function of 
social class and integration setting. A surprisingly small number of dif- 
ferences in child rearing were found to be associated with either. Miller 
and Swanson looked at differences between groups defined by both class 
and integration setting, e.g., entrepreneurial middle vs. bureaucratic lower, 
and related these differences to the findings of the Chicago and Boston 
studies. Although their findings were generally quite disparate from those 
of the Chicago study, they concluded that comparisons between the entre- 
preneurial middle class and the lower class of either integration setti 
tended to resemble the Chicago findings; comparisons between the bureau 
cratic middle and entrepreneurial lower class showed some resemblance 
the Boston findings. In both instances, however, a great deal of the re- 
semblance pertained to variables on which the finding was of no difference. - 
Miller and Swanson, then, do present some limited evidence indicating 
that integration setting influences child-rearing practices and that 
sidering the integration setting may reduce somewhat the disagreem 
found among studies of child-rearing practices and social class. The e 
dence is not, however, sufficient to justify taking serious issue with 
negative conclusion of Littman et al. (1957) about important general ¢ 
ferences among classes in socialization practices. 


Some investigators of social class differences concentrated on broad di- 
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mensions of child rearing (e.g., restrictive vs. permissive) rather than on 
specific infant and child care practices. Klatskin, Jackson, and Wilkin (1956) 
found some interesting though generally not statistically significant trends 
in child-rearing styles associated with social class membership. Upper- 
middle-class mothers showed more optimal child-rearing practices (neither 
too rigid nor overpermissive) related to feeding, sleeping, toileting, etc., 
than did either lower-middle-class or upper-lower-class mothers. Lower- 
middle-class mothers were the most likely to have rigid practices. What 
most characterized upper-lower-class mothers was that they showed no 
consistent pattern, but varied in optimal, rigid, or overpermissive behavior 
from one aspect of training to another. 

It appears that even on broad dimensions of child rearing, findings 
about social class and child-rearing relationships are inconsistent. Indeed, 
the very meaningfulness of such broad dimensions as “permissiveness” was 
questioned by Kohn (1959a), and the conceptual difficulties which inhere 
in abstracting such broad dimensions from particular child-rearing prac- 
tices were cogently discussed by Littman et al. (1957). 

Inconsistencies among studies have sometimes been attributed to the 
general inadequacy of the survey technique upon which they depend. The 
parent interview has very uncertain validity as an indicator of actual 
child-rearing practices. Although there is evidence that the interview tech- 
nique may sometimes provide accurate information concerning child- 
rearing practices (e.g., Klatskin, 1952), a growing body of evidence on the 
social desirability factor in subjective reports (Christie and Lindauer, 1963; 
Edwards, 1957; Marlowe and Crowne, 1961; Taylor, 1961) suggests that 
some of the supposed class differences in child rearing and some of the 
inconsistencies across studies may actually relate to variations in the par- 
ents’ sensitivity to what constitutes a socially desirable statement about 
child rearing. 

Inconsistencies have also been attributed to the fact that various 
studies are based on data collected in different years. Variations in find- 
ings do seem likely to partially reflect real changes in practice occurring 
differently at different class levels. That the advice experts give to parents 
on how to raise their children has changed over the years was documented 
by Stendler (1950) and by Wolfenstein (1953). Bronfenbrenner (1958) 
reanalyzed some of the studies of social class and child-rearing practices 
described herein and demonstrated, particularly for the middle class, a 
high degree of correspondence between child-rearing practices reported 
and expert advice prevailing at the time. Thus, Bronfenbrenner managed 
to reduce the inconsistency among studies. 

In view of the significance of the date the study was conducted and 
the possible contaminating effect of the interview technique, a recent 
investigation by Waters and Crandall (1964) on social class and maternal 
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behavior is of special importance. Employing home visit data collected at 
the Fels Institute on children between three and five years old, they ex- 
amined the relationship between nine types of observed maternal behavior 
and social class membership at three periods: 1939-1941, 1948-1952, and 
1959-1961. No significant relationships were found between social class and 
nurturant maternal behavior, defined by the variables of babying and 
protectiveness, at any of the three times. Social class was also found to 
bear little relation to affectionate maternal behavior, defined by the vari- 
ables of affectionateness and direction of criticism (approval); the only 
significant relationship found with these two variables was that in the 
1940’s, maternal approval was positively correlated with social status. 
Maternal coercion, defined by the variables of coerciveness of suggestions 
and severity of penalties, was found to be somewhat more associated with 


socioeconomic class, In the 1960's sample, both variables were found to @ 


be negatively correlated with social Status; coercion was higher in the lower 
class. The maternal behavior variable most consistently related to socio- 
economic class was found to be restrictiveness of regulations; during all 
three time periods, the lower the family status, the more a mother tended 
to impose restrictive regulations on her child’s behavior. The variables of 
clarity of policy and accelerational efforts were found to be positively re- 
lated to social class in the 1940’s but not in the 1950’s or 1960’s. Altogether, 
of the 27 correlations (nine variables at three time periods), nine were 
Statistically significant, and in no instance did a significant result at one 
period reverse a significant result of another period. Waters and Crandall 
noted that the nature of their sample differs from those of earlier investiga- 
tions and pointed out that their results tend to disagree with those of 
studies employing the interview technique, but to agree with those of other 
studies employing direct observation. <i 


Waters and Crandall reported some consistent changes in maternal 
behavior over time. Regardless of social class, mothers became progres- 
sively less coercive between 1940 and 1960. Nurturant and affectionate 
behavior exhibited a curvilinear trend between 1940 and 1960; babying, 
Protectiveness, affection, and approval peaked in 1950, at the height of 
the permissive era,” were lower in 1940, and were lowest in 1960. Con- 


the Klatskin et al. (1956) finding of greater permissiveness of mothers 
regardless of social class between approximately 1940 and 1950, though 
Klatskin et al. dealt with the first year of the child’s life instead of the 


” 

The discovery that there are trends in child-rearing practices that 

cut across social class membership does little to illuminate the central 
issue of class differences. To this central issue, Waters and Crandall made 
the important suggestion that there may be a very different understanding 
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of class differences when there is better knowledge based on direct observa- 
tion. For the present, I would emphasize along with Littman et al. (1957) 
that even in those instances in which a statistically significant relation- 
ship between social class and child-rearing practices was found, the mean 
difference between populations was so small, compared with the great 
overlap in the distributions and the large spread of each distribution, 
that the discovered differences were often relatively trivial in predictive 
and explanatory power. 


Other Interpretations of Intra-Societal Differences 


However socialization is defined, it should be regarded as a life-long 
process. The importance of socialization phenomena at periods of the life 
cycle other than childhood has been stressed by a number of workers who 
differ significantly in other respects. Erikson (1959), while continuing to 
regard the early years as especially important, viewed the individual’s 
behavior as the outcome of a series of conflicts or crises which occur 
throughout life and argued for the need for equally explicit attention to 
all periods. Social learning theorists (cf. Bandura and Walters, 1959) 
emphasized the importance of “models” whose behavior is imitated. This 
approach suggests that in adulthood the behavior of models in the indi- 
vidual’s milieu will be of prime importance. Instrumental learning theorists 
(cf. Bijou and Baer, 1961) stressed reinforcement contingencies as the 
ultimate determinant of the individual’s social behavior. Within this 
framework, paramount importance would be given to the individual’s 
relatively recent history of rewards and punishments accompanying the 
particular social behavior of interest. Finally, more sociological thinkers 
(cf. Brim and Wheeler, 1966) underlined the continuing importance of 
socialization through adulthood, noting that the individual never ceases 
to adopt new social roles, and that most of the pertinent socialization 
occurs around the time of adoption of these roles rather than decades 
earlier. 

Thus, in opposition to relying exclusively on a child-rearing explana- 
tion, social class differences in behavior leave open a broad spectrum of pos- 
sible interpretations ranging from the social-ecological to the genetic. The 
former extreme illustrates the tendency to seek psychological mediators for 
relationships first observed at a social level. Aberle (1961); Barry, Child 
and Bacon (1959); and Miller and Swanson (1960) presented social- 
ecological accounts of class differences and suggested psychological inter- 
pretations, though, except for Miller and Swanson, the empirical context 
was largely inter-societal variation rather than intra-societal. Like Whiting, 
these investigators sought to escape the culture-socialization circularity by 
viewing the maintenance system as the prime mover, and, in what might 
be called an implicit Marxist approach, they stressed the influence of an 
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economy on socialization and personality. A society requires individuals 
capable of performing the necessary economic functions and will tend to 
select socialization practices that will favor values and behavior contribut- 
ing to that capability. 

This point of view is seen most clearly in the work of Miller and 
Swanson (1958, 1960), whose concept of the integration setting represents 
a particularly interesting effort to reduce the class concept to a psycho- 
logically more meaningful level. To Miller and Swanson, the American 
economic system has so changed that members of a single class may differ 
widely in the pressures their economic function exerts toward personality 
type. Whether a family is engaged in entrepreneurial or bureaucratic 
activities determines the socialization practices they engage in. Thus, there 
is an effort here to mediate class differences in behavior by calling into 
play the economically oriented concept of integration settings and by 
viewing socialization as directed towards producing individuals who have 
social-psychological characteristics in keeping with their integration settings. 

At the opposite end of the ecological-individualistic continuum is the 
genetic interpretation. In its immature stage, psychology tended to seek 
single causes of behavior, and a solely genetic influence was then easy to 
dismiss. The continuing neglect of genetic considerations now that psy- 
chology has matured is probably related to the American ethos. The 
egalitarian tradition of the United States has unquestionably contributed 
to the absence of research on possible genetic influence on social class 
differences and to the near-absence even of the discussion that might lead 
to it. 

Gottesman (1965, 1968), noting the important work of Burt (1959, 
1961), recently published some valuable papers which help fill this gap. 
He pointed out that the social class differences are those between popula- 
tions rather than between individuals and that whenever there is a sizable 
degree of reproductive isolation between populations, the relative frequen- 
cies with which the different forms of genes occur in their gene pools 
differ. Basing his views on the clear fact of assortative mating within social 
classes and the evidence of definite genetic influence on some aspects of 
personality (see Vandenberg, 1965), Gottesman argued that some social 
class differences in behavior may rest partially on a genetic basis rather 
than on the wholly environmental basis often supposed. Although his 
hypothesis is speculative, as Gottesman pointed out, it may well merit 
more attention than it has received to date. 

Somewhere between the social-economic-ecological interpretation of 
social class differences in behavior and the genetic interpretation is 
developmental viewpoint advanced by Zigler and his co-workers (cf. Katz 
and Zigler, 1967; Kohlberg and Zigler, 1967; Phillips and Zigler, 1961, 
1964; Zigler, 1963). Building loosely upon the theoretical approaches of 
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Piaget (1950, 1953, 1955, 1962) and Werner (1948), these investigators 
suggested that behavioral differences between the lower and middle classes 
are due to the differing developmental characteristics of individuals within 
the two classes. The argument here is that the development progression 
of individuals in the lower class is on the average slower and more limited 
than that of individuals in the middle class, and that differences in be- 
havior between the classes from childhood onward arise because com- 
parisons are being made between groups of individuals who are of different 
average developmental levels. The developmental approach purposely has 
remained ambiguous about the causes of differences in the rate of develop- 
ment and in the upper levels achieved. At the present time, these differ- 
ences can be attributed to genetic factors, differences in environmental 
inputs, or, perhaps most reasonably, to some interaction between these 
two sets of factors. 

In keeping with Piaget’s thinking, proponents of the developmental 
approach to social class differences emphasized the formal cognitive char- 
acteristics of the individual as a crucial mediating structure in the person’s 
intercourse with his environment. If social classes differ greatly in the 
distribution of formal cognitive structure or developmental level of their 
members, one would expect to discover social class differences in behavior. 
Although differences in the rate of cognitive development associated with 
class membership are now well-documented, their role in producing class 
differences in behavior has been largely ignored. However, American 
psychologists have tended to broaden the purely cognitive definition of 
developmental levels to include a wide array of social competence indices, 
styles, values, and orientations indicative of personal and social maturity 
(cf. Phillips and Zigler, 1964; Zigler and Phillips, 1960, 1962). Within 
the developmental framework, it is not an individual’s prestige, nor the 
general culture of his socioeconomic class, nor the various roles he occupies 
in society that are emphasized as direct determinants of behavior, but 
rather his internal psychological structure. The extreme sociogenic view 
assumes that the individual’s cognitive structure is entirely a product of 
class membership; the extreme developmental version postulates that the 
cognitive structure a person attains will be the sole determinant of his 
future class membership—that is, the sole determinant of what culture 
he will be comfortable with or will join in creating. The truth obviously 
lies somewhere between these two theoretical extremes, and presumably 
neither extreme has any adherents. 

In the less extreme and more tenable form in which it is actually 
advanced, the developmental approach seems to be a legitimate attempt to 
understand some of the effects of the sociological variable of social class 
membership in terms of the psychological variable of personal develop- 
mental level. On an empirical level, it should be noted that each concept, 
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ie social class and developmental level, can be separately and reliably 
defined. In instances where social class largely determines developmental 
level or vice versa, measures of the two would be highly correlated. The 
two would, no doubt, always retain sufficient independence to permit the 
determination of how much of the variance in any other variable can be 
attributed to one and how much to the other. It is known that relation- 
ships of a certain magnitude have been found between social class and 
particular behavioral variables, If the magnitude of these relationships 
is enhanced by substituting developmental level for social class, then the 
developmental interpretation of social class differences in behavior takes 
on added credence. If the magnitude of the relationships is reduced, then 
the developmental argument is weakened. In tests of this sort, a uniform 
outcome is not to be expected for all variables. If such a program were 
carried out, one would probably discover that developmental level medi- 
ated some relationships between class and behavior but did not mediate 
others. The explanation for relationships not mediated by developmental 
level would then become the domain of a variety of the other theoretical 
approaches alluded to above. 

At a theoretical level the developmental approach has the advantage 
of allowing the utilization of a somewhat untidy but nevertheless broad 
body of research on developmental processes. This research places a 
number of restraints on developmentalists’ efforts to explain social class 
differences and, of more importance, dictates the particular relationship 
that should be found between social class and certain behavior. If, for 
example, members of the lower class are on the average characterized by 
a lower developmental level than members of the middle class, then the 
two classes should be distinguished on a variety of specific variables asso- 
ciated with developmental level. Several of the class differences in behavior 
noted earlier conform to this expectation. For instance, the greater guilt, 
self-derogation, and intropunitiveness up to and including suicide (Henry 
and Short, 1954; Miller and Swanson, 1960; Zigler and Phillips, 1960) 
found in individuals of the middle as compared to the lower class are 
predictable from developmental theorizing. As Phillips and Rabinovitch 
(1958) pointed out, such “turning against the self” implies an introjection 
of social standards which is more characteristic of higher than of lower 
levels of development. Evidence that an increasing capacity to experience 
guilt accompanies increasing cognitive growth and development was 
presented by Katz and Zigler (1967). 

t A particularly striking instance in which a social class-behavior rela- 
tionship is consistent with developmental thought is Kohn’s finding (1959a, 
1959b) that working-class parents tend to respond to their’ children’s 
transgressions in terms of the immediate consequences of the child’s 
actions, whereas middle-class parents tend to respond in terms of the 
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child’s intent in acting as he does. As Kohn pointed out, this distinction 
is quite in keeping with the developmental distinction made in Piaget’s 
(1962) discussion of moral realism. 


Also consistent with the developmental interpretation of social class 
differences in behavior are the general findings that lower-class persons 
are somewhat more ready to resort to physical punishment, are more 
physicalistic in their choice of occupations, and engage in more acting-out 
up to and including homicide, whereas middle-class persons tend to be 
more obsessive and ideational (Henry and Short, 1954; Miller and Swanson, 
1960; Phillips and Zigler, 1964; Zigler and Phillips, 1960). This contrast 
in life style corresponds closely to an important dimension in development, 
namely the action-thought dimension. Developmental theorists of both 
psychoanalytic (Freud, 1952; Hartmann, 1952; Kris, 1950; Rapaport, 1951) 
and nonpsychoanalytic persuasion (Lewin, 1963, Piaget, 1951; Werner, 
1948) suggested that primitive, developmentally early behavior is marked 
by immediate, direct and unmodulated response to external stimuli and 
internal need. In contrast, higher levels of maturation are characterized 
by the appearance of indirect, ideational, conceptual, and symbolic or 
verbal response. The developmental action-thought dimension offers a 
clear alternative to the sociogenic interpretation which would view the 
greater acting-out of lower-class individuals as a direct product of their 
conformity to lower-class culture. According to developmental interpre- 
tation, both the individual's acting-out and the lower-class culture which 
encourages it would be viewed as reflecting the developmental character- 
istics of class members. 


Similar disagreement between the external-sociogenic emphasis and 
the internal-developmental emphasis arises in considering class differences 
in the incentive value of being correct—a motivational characteristic espe- 
cially significant in the socialization process. 
now been presented either indicating or suggesting that middle-class chil- 
dren are more motivated to be correct for the sheer sake of correctness 
than are lower-class children (Cameron and Storm, 1965; Davis, 1964; 
Douvan, 1956; Ericson, 1947; Terrell, Durkin and Wiesley, 1959; Zigler 
and deLabry, 1962; Zigler and Kanzer, 1962). Zigler and Kanzer, for 
instance, found that the verbal reinforcers most effective with lower-class 
seven-year-olds were those indicating personal praise (“good” and “fine”), 
while the verbal reinforcers most effective with the middle class were those 
indicating their behavior was correct (“right” and “correct”). Two quite 
different interpretations can be applied to this finding. A somewha 
sociogenic interpretation would be that “being right” is a value that is hele 
in higher regard in the middle than the lower socioeconomic class, ant 
therefore for the middle-class seven-year-olds, as compared to the lower 
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class seven-year-olds, “being right” has been more frequently associated 
with secondary and primary reinforcers. 

An alternative explanation would employ the concept of a develop- 
mentally changing hierarchy of reinforcers. As was suggested by Beller 
(1955), Gewirtz (1954), and Heathers (1955), the effectiveness of atten- 
tion and praise as reinforcers diminishes with maturity, while the rein- 
forcement inherent in the information that one is correct progressively 
increases in effectiveness. This shift is away from reinforcement by others 
and toward reinforcement by self and appears to be central to the child’s 
progress from dependence to independence. 

Though the child’s social experience obviously remains relevant, this 
explanation does not attribute special importance to the type of reward 
customary in the child’s environment; it stresses instead the child’s cogni- 
tive ability—specifically, his ability to comprehend a verbal stimulus as 
a cue for self-reinforcement and to be able to administer this type of rein- 
forcement. This ability requires that the child differentiate himself from 
others and comprehend that his success is a direct outgrowth o? his own 
efforts; it also involves the maturity required by the rather complicated 
process of taking the self as an object that can either be rewarded (and 
hence feel proud) or punished (and hence feel ashamed or guilty). 

Such a process is a far cry from that earlier period in the child’s 
life when the efficacy of a social reinforcer is probably dependent upon 
its close relationship to primary reinforcers, and a wide array of social 
stimuli influences behavior in a relatively undifferentiated hedonistic way 
involving little or no central mediation. At an earlier age, the child might 
respond to the spoken word “good” as a reinforcer in some such direct 
way without the involvement of complex processes which at a later age 
might make “good” and a variety of other words and gestures equivalent 
because of their common implications. 

At a later age, reinforcers which consist of praise (words such as 
“good” and “fine”) would be conceptualized as conveying information to 
the child on how the speaker feels toward the responses the child has 
made. When the child is able to feel that powerful adults are pleased 
with him, he may anticipate further reward from them. At an even later 
developmental level, however, the child becomes more liberated from con- 
cern with the feelings of social agents, and the task of obtaining primary 
reinforcers from them normally becomes less urgent. He becomes a more 
autonomous agent primarily interested in obtaining mastery over his world. 
The motive of effectiveness becomes central, and he becomes interested in 
the quality of his own performance. Here his concern is not limited to 
how social agents feel about him but is extended to how he feels about 
himself. How he feels about himself, moreover, is determined by the 
success he encounters in dealing with the continuous problems presented 
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by the environment. What now interests him is whether he is doing 
things correctly, whether he is right. Thus, social agents and the social 
reinforcers they dispense take on new meaning. At this stage the social 
reinforcer signifying successful coping by the child is the one he values 
most; the feelings of the social agent, though related, recede in importance. 

When this reasoning is applied to the finding that seven-year-old 
middle-class children are more motivated to receive reinforcers indicating 
correctness than are seven-year-old lower-class children, it suggests that 
the latter are developmentally lower than the former in not having made 
a transition in which reinforcers signifying correctness replace reinforcers 
signifying praise in the reinforcer hierarchy. Related to this argument is 
the work of several investigators (Davis, 1941, 1943; Terrell et al., 1959; 
Zigler and deLabry, 1962) indicating that lower-class children are less 
influenced than middle-class children by abstract, symbolic rewards. This 
would obviously be expected if the lower-class child were indeed develop- 
mentally younger than the middle-class child of the same chronological 
age. Some recent studies (McGrade, 1966; Rosenhan and Greenwald, 
1965) failed to support the reinforcer-hierarchy interpretation of social 
class differences in preference of particular classes of verbal reinforcers. 
Yet, so many findings are consistent with this interpretation, and with 
the more general developmental approach of which it is a part, that 
me investigation of their implications and validity is clearly called 
or. 

None of the positions which I have examined—the specific child- 
training practices, the social-ecological-economic, the genetic, the develop- 
mental—appears capable of solely explaining all of the behavioral correlates 
of social class membership. The positions probably differ in the contribu- 
tion each can make in isolation, and this depends on the general state 
of knowledge at the time. It may be concluded that isolated emphasis on 
child-training practices taken out of context is probably of limited value, 
and that the lesson it can teach us, if anything, too well learned today. 
Given the limited work to date, it may also be concluded that the develop- 
mental approach offers today some rather novel understanding even when 
considered in isolation, But one can be confident that, with real inter- 
locking of the various explanations, @ still better understanding will be 
attained, 
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6: DESEGREGATION AND MINORITY 
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The 1954 Supreme Court desegregation decision was 
based on fine legal and moral argument, but on rather slim social science 
evidence (Deutscher and Chein, 1948; Clark, 1953). In view of the clear 
challenge to the caste system that the decision represented and the extent 
of the changes called for, it is singular that so few empirical tests of 
the dictum that segregation per se is harmful to children were undertaken 
in the decade that followed. Then came the Equality of Educational Op- 
portunity Survey (Coleman, 1966). The magnitude of this study and its 
unexpected findings suggest that research on the education of minority 
children should be labeled “before Coleman” or “after Coleman.” The 
insights afforded by the survey and its methodological limitations propel 
social scientists to more definitive research. To date no longitudinal studies 
with adequate samples and controls have been published, but reports on 
an increasing number of small-scale bussing experiments are becoming 
available. Pieced together, such bits of evidence help define the shape 
of the larger puzzle and identify what is known and what is not known. 

This chapter reviews Pre-Coleman, Coleman, and Post-Coleman em- 
pirical evidence on the relation of school racial composition to the acadamic 
performance of black children. The independent variable is called “racial 
composition” advisedly; neither scholars nor schoolmen agree on the defi- 
nitions of racial balance, desegregation, non-segregation, or integration. 
Some use these words interchangeably, while others distinguish between 
them according to whether or not the school ethnic mix matches that of 
its community, whether or not a uni-racial school has become bi-racial, 
whether or not the process was planned, and whether or not the minority 
group is accepted into the social life of the school. I am interested in 
school racial mixture of any type, although I recognize that surrounding 
circumstances may condition its relationship to academic performance. 

Many possible outcomes of schooling-creativity, curiosity, civic re- 
sponsibility, moral judgment, artistic taste, human sensitivity—are im- 


*Dr. David Cohen, Harvard University, served as a consultant to Dr. St. John on the 
Preparation of this chapter. Support for the preparation of this manuscript was supplied 
in part by U.S.O.E. contract #6-10-240 with Yeshiva University. 
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portant, but they typically go unmeasured in social research. Therefore, 
although the subject of this chapter could be the relation of school racial 
composition to children’s total intellectual, emotional and moral develop- 
ment, for the most part I review research in which achievement test 
scores were the criterion variable. 


Choice of Research Design 


The experimental model is a convenient way of organizing the evi- 
dence I am interested in. To establish a causal relationship between 
classroom ethnic composition and academic performance, a researcher 
should employ the classic pre-test, post-test model with subjects randomly 
selected and assigned to experimental and control groups. Both groups 
are tested before the experimental group is subjected to the test condition 
(desegregation), all other conditions must remain the same for both groups, 
and later both groups are tested again. A greater change between Time | 
and Time 2 for the experimental than for the control group can presum- 
ably be attributed to the effect of desegregation (Stouffer, 1962; Campbell 
and Stanley, 1963; Pettigrew and Pajonas, 1964). 

Such experimental research in the area of desegregation is very diffi- 
cult to achieve. The random assignment to control and experimental 
conditions usually seems politically and morally questionable, and the 
loss of cases through migration or school leaving is apt to jeopardize the 
randomness of the sample. Therefore one finds frequent use of a longi- 
tudinal design without a control group, of a cross-sectional design without 
a Time | measurement, or of quasi-experiments with statistical rather 
than actual control of other variables, even though any correlation between 
key variables can not be taken as evidence of a causal relation between 


Two rival independent variables, school quality and family back- 
ground, are especially likely to contaminate research on the effect of 
ethnic segregation on the performance of children. In pre-Coleman Report 
days probably few people doubted that school quality varied with racial 
composition or that predominantly black schools were by and large inferior 
to predominantly white ones in physical plant, equipment, curricular offer- 
ings and teacher qualifications, Comparisons of Negro and white schools 
in the South (McCauley and Ball, 1969; Miller, 1960) or ghetto and 
non-ghetto schools in the North (Public Education Association, 1955; 
Conant, 1961; Sexton, 1961; Wolff, 1963; Katzman, 1966) certainly sup- 
ported this conclusion. There have also been repeated reports that teachers 
in ghetto schools have low morale, a low opinion of their pupils, and an 
eagerness to transfer to more middle class (or white) settings (Becker. 
1952; Gottlieb, 1964; Clark, 1965; Herriott and St. John, 1966). Coleman 
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(1966) found many differences within regions between schools attended 
by majority and minority children, to the advantage of the former, though 
ost of these were surprisingly slight. Critics of the Coleman Report 
argued that various methodological limitations of the study mask the true 
relation between race and school quality (Nichols, 1966; Bowles and 
Levin, 1968; Dyer, 1968). Thus, any superiority in the performance of 


integrated over segregated children could in part be due to a difference in 
school quality. 
The issue of school equality is raised in a different form by the 
cent introduction of compensatory programs into most Northern city 
ol systems. To the extent that such programs tend to remove former 
ities and to equalize education across schools, they act as a control 
in studies comparing performance in segregated and integrated settings. 

To the extent that they go beyond equalization and offer extra services 

“to minority group children—newer buildings, smaller classes, greater per 
pil expenditure, better prepared teachers—they in theory make it more 

ther than less difficult to test the effect of ethnic composition per se 
Gordon and Wilkerson, 1966; U. S. Commission on Civil Rights, 1967). 
© The second variable most likely to contaminate research on the effect 
of school desegregation on pupil performance is the social and economic 
level of home or neighborhood. Most researchers find that socioeconomic 

© status (SES) predicts achievement for Negroes, though less well than for 
| whites (for example, McGurk, 1953; Klineberg, 1963; Kennedy, Van de 
Riet and White, 1963; Pettigrew, 1964; Stodolsky and Lesser, 1967). In an 
attempt to isolate other measures of a child’s home environment that 
“would predict his school aéhievement better than the usual indices of 
SES, Peterson and Debord (1966) interviewed 11 year-old Negroes in a 
Southern city; they found a set of 11 home variables that had a multiple 
correlation of .82 with achievement scores. In a similar endeavor, White- 
man, Brown, and Deutsch (1966) developed a Deprivation Index (meas- 
“uring housing dilapidation, number of siblings, kindergarten attendance, 
educational aspiration of parent for the child, dinner conversation, and 


U.S. Commission on Civil Rights (1967) reports corroborated such 
evidence on the relationship between aspects of home background and 
the verbal achievement of minority group children. If school quality and 
the family background are positively related to the achievement of minority 
pupils and to their schools racial composition, it is crucial to control them 


_ In any study of the influence of ethnic composition. 
Longitudinal One-Group Studies 
The merit of the longitudinal study is that selection bias is partly 
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ruled out if the same subjects are tested before (or at the beginning of) 
a period of non-segregated schooling and again after some months or years 
in the desegregated situation. The weakness of such a design is that with- 
out a control group, there is no assurance that any observed effect is 
not due to the influence of previous testing, to normal maturation, to 
extraneous events, or especially to a change in the quality of schooling. 


Desegregation of School Systems 


Two studies often referred to as evidence of the beneficial effect of 
the desegregation of a school system are Hansen’s report on Washington, 
D.C. (1960) and Stallings’s report on Louisville, Kentucky (1959). Han- 
sen reported that in the five years following consolidation of the separate 
school systems in the District of Columbia, median city-wide achievement 
improved at all grade levels and in most subject areas. Unfortunately, for 
a number of reasons this finding is not evidence that desegregation was 
causally related to improved minority group performance. 1) No testing 
of black children was done before desegregation, and no separation of 
black and white scores was made after desegregation. 2) The scores of 
the same children were not traced through the years; instead successive 
third grade (etc.) classes were compared. 3) Actual racial composition of 
schools and classrooms was not considered, and the simultaneous establish- 
ment of the track system probably resulted in considerable classroom seg- 
regation in those schools that were technically desegregated. 4) With 
desegregation came major improvements in the quality of education— 
lowered teacher-pupil ratios, increased budget, more remedial services; 
these furnish plausible alternative explanations of the improved perform- 
ance. 

Schools in Louisville were desegregated by court order in 1956. Stal- 
lings (1959) reported that the academic achievement of Negro and white 
students was significantly higher after than before desegregation and that 
the Negro students made greater gains than the white students did. But 
again there is evidence that most schools remained segregated. The gains 
of black pupils were greatest when they remained with black teachers, 
ie., in all-black schools (United States Commission on Civil Rights, 1962, 
p. 34). Stallings suggested that one factor may have been increased moti- 
vation resulting from legal desegregation. 


Desegregation of Individuals 


In studies of the effect of Northern residence on the intelligence test 
scores of Negro children, the investigators regularly found that migrants 
from the South score higher in proportion to their length of residence in 
the North (Klineberg, 1953; Lee, 1951; Moriber, 1961). Unfortunately 
in these migration studies, the racial mix of the Northern schools was 
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not reported and the effects of three variables are confounded: community 
with school desegregation and school desegregation with school quality. 
Researchers can more successfully test the unique effect of school desegrega- 
tion when individuals enter a mixed school, especially if those individuals 
do not change residence to a mixed neighborhood. 


A large longitudinal study of the impact of desegregation on pupil atti- 
tudes and performances is in process in Riverside, California. In the fall of 
1965 the Riverside Board of Education decided to close three buildings and 
bus their Negro and Mexican-American pupils to other buildings in the 
system (Purl, 1967, 1969). Since de facto segregation was thereby elimi- 
nated, no control group will be available. To date the only progress reports 
indicate that two years of integration have had “not much effect” on 
achievement, but since neither social class nor other variables were con- 
trolled, even this conclusion seems unwarranted. 


Katzenmeyer (1963) studied all pupils who entered the public kinder- 
gartens of 16 Jackson, Michigan schools in the year 1957 and 1958. The 
192 Negro children, who thus represented the full socio-economic range 
in the city, were distributed among 11 integrated schools; one school was 
66% black and the rest were below 40%, black. Children were given the 
Lorge-Thorndike IQ test at the beginning of Kindergarten and second 
grade. The two-year gain for white children was 1.87 points, for Negro 
children 6.68—a difference significant at the .001 level. It is regrettable 
for research purposes that no matched group of black children spent those 
years in segregated Jackson schools. 

A recently completed evaluation of Project ABC, a scholarship program 
which brings disadvantaged high-school students to independent boarding 
schools, followed an entering class of 82 boys (10% Negro, 10% American 
Indian, 9% Puerto Rican, 2% Oriental, and 9% white) who entered 39 
different schools in the Fall of 1965 (Wessman, 1969). Faculty reported 
scholastic gains for half the group, but test-retest showed no significant 
change on mean Otis IQ or Cooperative English Achievement tests. 

Clark and Plotkin (1963) surveyed Negro students at integrated col- 
leges, The sample consisted of the 509 students who returned question- 
naires out of the 1519 who received aid or counseling from the National 
Scholarship and Service Fund, 1952-1956. College grades were found to be 
higher than could have been predicted on the basis of pre-college scholastic 
aptitude test scores, The net dropout rate of 10% was one-fourth the 


study would have been strengthened by more random sampling, by post- 
tests and by the testing of a control group of similar students at segregated 
colleges, since it seems probable that many students who applied to the 
NSSENS were especially able, motivated and likely to succeed wherever 
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they enrolled. Even so, there would be no way of knowing whether any 
differences found were due to the integration or to the quality of the college 
experience. These criticisms, especially of the lack of control on other 
variables, apply not only to the Clark and Plotkin study but to all one- 
group “before and after” studies that have been reviewed here. 


Cross-Sectional Studies 


The major weakness of cross-sectional studies, in which segregated 
and integrated subjects are compared without either group having. been 
tested before the introduction of desegregation, is that there is no guarantee 
that the two groups were originally equivalent. Systematic differences are 
found by researchers who compare the characteristics of families living 
in integrated and segregated neighborhoods (Duncan and Duncan, 1957; 
Stetler, 1957; Hughes and Watts, 1964) or of families who do and do not 
volunteer for bussing experiments (Crockett, 1957; Luchterhand and Weller, 
1965; Weinstein and Geisel, 1962). It is therefore quite likely that the 
poo subjects are a selected group, in terms of social class and/or 
ability. 


Pre-Coleman Studies 


As early as 1930, Crowley studied Stanford Achievement Test scores 
of Negro fourth to sixth grade pupils in two segregated and four mixed 
schools in Cincinnati. Of 110 children equated on grade, age, and Stanford 
Binet IQ, half had experienced only segregated and half only integrated 
schools. The two groups were said to be comparable also on “Physical 
condition, family history and social status”. No statistically significant 
differences were found on 11 of 13 achievement tests, Spelling and writing 
were the exceptions; differences in these skills favored mixed schools 
(Crowley, 1932). 

Radin (1966) studied the black pupils of two neighboring elementary 
schools (45% and 100% Negro) in Ypsilanti, Nc The ache were 
said by Radin to be alike in financial support and curriculum and in the 
IQ and SES of pupils, but the original equivalence of pupils in the two 
schools was not satisfactorily demonstrated, No statistically significant dif- 
ferences were found in mean scores on IQ or Iowa Test of Basic Skills, 
though the direction of the difference was uniformly in favor of the inte- 
grated school. When the performance of very high or very low achievers 
w Fe however, the differences were in favor of the segregated 
school. 


In a similar study of naturally segregated and non-segregated elemen- — 
tary school children in New York, Jessup (1967) seemed D Negro and 
Puerto Rican second and fifth graders in a traditional, middle-class school 
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(75% white) with b) students in a comparable, low SES, project school 
(96% Negro and Puerto Rican) and with c) students in a new, Higher 
Horizons school (93% Negro). Since social class (measured by residential 
census tract data) was found to be so highly related to achievement, sub- 
samples of 18 integrated and 80 segregated low SES children were compared 
on IQ, math and reading. This comparison revealed a distinct disadvantage 
for the segregated children, even for those in the Higher Horizon school 
with superior facilities and remedial services. The lowest SES children in 
the integrated school showed higher achievement than middle SES children 
in the segregated school. This study is handicapped by small sample size, 
inadequate control on individual SES, and lack of any “before” measure- 
ment. 

In four recent dissertations, the investigators used roughly the same 
design. Meketon (1966) gave a battery of tests to Negro fifth and sixth 
graders in three schools in Kentucky: school A, de facto segregated; school 
B, peacefully integrated; and school C, integrated under “anxiety arousing 
circumstances”. In schools A and B samples of children were matched on 
age, grade, sex, Otis Quick Scoring IQ and SES. In school C all children 
were included; though not matched with the other children they were 
reported to be generally similar in background but of somewhat higher IQ 
(on the California Test of Mental Maturity). Contrary to prediction chil- 
dren in school C had significantly higher scores than pupils in school A on 
the Digit Span Backward and Verbal Meaning Tests, and higher than 
school B pupils on these and a Space Ability Test. Unfortunately, the 
a difference in IQ could explain the superior performance of school 

pupils. 

In a New York State community, Lockwood (1966) compared 217 sixth 
grade black students attending balanced and unbalanced (over 50% Negro) 
schools for two years or longer. Contemporaneous scores on Iowa Tests of 
Basic Skills and the California Test of Mental Maturity indicated that 
the students in the balanced schools were significantly higher at all IQ 
levels. When IQ was not controlled or when students had been less than 
two years in balanced schools, the differences were in the same direction 

ut not statistically significant. The absence of control on individual SES 
and the evidence presented of higher mean SES in the balanced schools 
render the findings inconclusive. Samuels (1958) matched Indiana students 
on IQ and SES and found that at the first and second grade levels, Negroes 
in a segregated school had higher achievement scores, but in the third 
through sixth grades the achievement was higher in a racially mixed school. 

The most statistically sophisticated study in this group is Matzen’s 
(1965) correlational analysis of the fifth and seventh grade achievement 
Scores of 1,065 Negro and white children in 39 segregated and integrated 
classrooms in one California community. Zero order correlations indicated 


117 


REVIEW OF EDUCATIONAL RESEARCH Vol. 40, No. 1 


that per cent of Negroes in the classroom was significantly and negatively 
related to Negro achievement, especially at the higher grade level (pre- 
sumably because there was grouping by ability for seventh graders but 
not for fifth graders). However, when IQ and SES were controlled, second- 
order partial correlation showed the relationship to be no longer statistically 
significant. 

Thus, of six recent cross-sectional comparisons of black children’s per- 
formance in segregated and integrated elementary schools in six different 
communities and in five states, all found that without controls, achievement 
was higher in integrated schools. Several of the studies then controlled on 
1Q, but since they used a score more or less contemporaneous with that of 
achievement, they naturally found that the differences between segregated 
and desegregated children tended to disappear. IQ as well as achievement 
is malleable and the two tend to co-vary. A more valid procedure would 
be to test ability at the outset of pupils’ careers in segregated or integrated 
settings. But none of these investigators did so. Moreover the controls on 
social class were rough. The suspicion remains, therefore, that self-selection 
may have biased their findings. 


There have been a number of quasi-longitudinal cross-sectional studies 
in which investigators compared the secondary or college performance of 
students from segregated and integrated earlier schools. Wolff (1962) 
examined the high school records of black students in Plainfield, New 
Jersey; he found fewer dropouts, higher reading achievement, higher rank 
in the graduating class and higher enrollment in further education for 
the 20 graduates of an all-Negro school. No SES data and no significance 
tests on their differences were reported. Vane (1966) compared the high- 
school records of 52 black children from predominantly white schools in 
a large suburban community with those of 19 black children from an 89% 
Negro school. The average IQ was 100 for students from both types of 
school. She equated 17 pairs on IQ and SES and found no significant 
differences in achievement at any level. St. John (1962, 1964) reported 
that with SES controlled there was a non-significant trend toward higher 


high school test scores for those New Haven blacks who had attended more 
integrated elementary schools, 


Johnson, Wyer and Gilbert (1967) tabulated first term grades of 121 
Negro freshmen at one University of Illinois campus according to the 
racial composition of their Chicago high schools. A curvilinear relation 
was found; students from schools less than 50% or more than 90% black 
did better than those from schools 50-89% black. Further examinations 
indicated that students from one group of schools more than 90% black 
attained considerably higher grades than those from another group © 
schools of the same racial composition. The authors speculated that indi- 
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vidual or school social class probably explained the difference, but they 
gave no data with which to test their assumption. 

St. John and Smith (1969) analyzed the achievement of two-thirds of 
the black ninth graders in the city of Pittsburgh in 1966 (1388 pupils). 
When the effects of individual and neighborhood SES and sex had been 
removed through regression analysis, arithmetic achievement was signifi- 
cantly and negatively related to the average racial composition of a pupil’s 
school in grades 1-9. 

Thus, the balance of evidence of pre-Coleman cross-sectional studies 


_ is that, at the very least, integration had little negative effect on minority 
_group performance and that it apparently had a positive effect, though it 


is hard to be sure, since other variables could account for the observed 
trends. (See Weinberg, 1968, for reports on other cross-sectional studies.) 


Coleman and Commission Reports 


In the Equality of Educational Opportunity Servey (EEOS), the in- 
vestigators administered a series of achievement tests and questionnaires 
to more than 600,000 students in some 4,000 elementary and secondary 
schools (Coleman, 1966). Verbal ability scores showed more variation 
than other test scores and were selected as the chief measure of academic 
achievement. 

Only a small part (10-20%) of the variance in achievement was found 
to be between rather than within schools (see Table 3.221.2 in Coleman 
et al., 1966). For Negroes, up to 30% of the between-school variance 
was accounted for by differences in family background (see Table 2.221.2). 
With the effect of family background removed, school characteristics ac- 
counted for little of the remaining achievement variance, but of these, 
characteristics of fellow students accounted for more than other school 
attributes (see Table 3.23.1). For this review the most important findings 
of the report concerns the effect of racial segregation: when students’ own 
background, characteristics of the school, and characteristics of the student 
body were controlled, school per cent white accounted for almost no verbal 
ability variance for Negroes (see Table 3.23.4). 

The U.S. Commission on Civil Rights (1967) reanalyzed the Cole- 
man data in tabular form; they concentrated on twelfth grade Negro 
students in the metropolitan Northeast and on ninth grade Negro students 
in eight regions. The analysis showed that classroom racial composition 
(in the previous year) made a difference in verbal achievement, beyond 
the social class of either the pupil or his fellow students or of teacher 
quality. (See Appendix Tables 4.1 to 8.12.) Moreover, the earlier the 
grades at which Negroes reported first having had white schoolmates, the 
higher their achievement, There are three probable reasons why classroom 
Per cent white “last year” should be more related to achievement than 
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school per cent “this year”: 1) At the ninth grade many pupils move 
from segregated elementary schools to desegregated secondary schools, 
“Last year’s” experience may be a proxy for eight years’ experience an 
therefore, more influential than a few weeks’ experience “this year.” i; 
EEOS tests were administered in September.); 2) Within desegregat 
schools, considerable segregation often results from the practice of a 
ment to classrooms on the basis of test scores: i.e., ability influences 
room per cent white. (It may also influence school per cent white but less 
strongly.); 3) Classroom per cent white influences ability and has a 
stronger effect than school per cent white. The authors of the Commission 
Report stressed this last point, but either of the other two may also be 
operative. z 
Further analyses of the EEOS data (McPartland, 1968) supported the 
Commission’s conclusion on the relation between classroom composition 
and achievement. McPartland showed that school desegregation was asso- 
ciated with higher achievement for black pupils only if they were in pre- 
dominantly white classrooms, but classroom desegregation was favorable — 
irrespective of school per cent white. He claimed that the classroom racial 
composition effect was not entirely explained by selection into track or 
curriculum. a 


The evidence that the Equality of Educational Opportunity survey — 
made available on the extent of ethnic segregation and academic retarda- 
tion for minority group children is invaluable. The evidence of the relation 
between segregation and retardation is not completely convincing, since 
it is subject to a number of methodological criticisms: 1) The representa- — 
tiveness of the sample was compromised by the non-cooperation of a 
number of large cities. 2) The measures of social class are unconvincing: 
There is no measure of family income. The item measuring parental oc- ~ 
cupation proved uncodable. Eighteen per cent of the ninth graders left 
blank or did not know their parents’ educational level. Other researchers — 
have found that children tend to upgrade their parents’ education OF 
occupation on a precoded questionnaire (Colfax, 1967. St. John, 1969). 
Self-reports on items in the home may also be biased and in any case, 
the quantity not the quality of these items was told. The reliability s 
conducted by Coleman’s staff in two school districts in Tennessee fo 
64% to 100% agreement between children and their teachers depe: 
on grade level and nature of item. It should be noted, however, 
teachers’ and children’s responses were not necessarily independent 
that blanks and I don’t know’s were counted as agreement, a questiona 
procedure. In short, it is difficult to have confidence that the effect 
social class was entirely removed. 3) Part of the apparent effect of t 
background of fellow students may be due to unmeasured variation in 
pupil’s own background. Smith’s (1969) reanalysis of the EEOS 
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indicated that two errors made in the original analyses led to underesti- 
mation of the effect of home background and an exaggeration of the effect 
of student characteristics. 4) The percentage of white schoolmates in the 
current year or of white classmates in the previous year may be less 
important than the per cent of white schoolmates or classmates over a 
number of years. Particularly at the ninth grade level present school racial 
experience is a poor estimate of past school racial experience. (See St. John 
and Smith, 1969.) 5) A cross-sectional analysis with no estimates of orig- 
inal ability or of the original equivalence of segregated and nonsegregated 
students can not conclusively demonstrate a causal relation between segre- 
gation and achievement or attitude. 


Experimental or Quasi-Experimental (Four-Celled) Studies 


A few studies have employed an experimental or quasi-experimental 
model, in that they provide Time 1 and Time 2 measurement of segre- 
gated and desegregated children matched on key variables. Two dissertation 
studies in the border South are four-celled in this sense. Fortenberry 
(1959) studied the mean achievement gains made in three subject areas 
during eighth and ninth grades of black pupils of similar IQ who attended 
mixed and nonmixed classes in Oklahoma City. Those in mixed classes 
made greater gains in arithmetic and language, but less gain in reading 
than did segregated pupils. No controls on social class were reported. 
Anderson (1966) found that 75 black fourth, fifth and sixth graders in 
five Tennessee desegregated (8 to 33% Negro) schools achieved signifi- 
cantly higher scores on Metropolitan Achievement Tests than did 75 black 
students from three segregated schools. The segregated and desegregated 
schools were judged “equivalent with respect to tangible factors” and their 
pupils were matched on age, intactness of family, third grade IQ scores 
and second grade achievement. The younger the age at which children 
had entered the desegregation schools, the greater was the apparent benefit. 


Bussing Experiments 


With increasing frequency in the last few years, school systems have 
issued reports of bussing experiments in which ghetto children are trans- 
ported to predominantly white schools. In most cases the bussed children 
Were tested before the program began and again one or two years there- 
after, and their gains were compared with those of children who remained 
in segregated schools. The validity of the comparison as a test of the 
effect of racial composition on pupil performance hinges on 1) the 
equivalence of the bussed and non-bussed children at the outset, 2) the 
holding power of the two programs, and 3) the equality of schooling in 
every way except racial composition. Therefore, evaluation of bussing 
Studies should note a) whether there is random assignment to experimental 
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and control groups, or at least matching on key variables, b) whether there 
is mortality of cases through withdrawal from either program or failure to 
appear for tests, and c) whether children are bussed to another school 
in the same system or to schools in a presumably superior system. 

New York state was the scene of a number of desegregation experi- 
ments and of the first five studies reviewed here. An early study of the 
effect of bussing in New Rochelle does not really belong in this group of 
4-celled studies since it lacks measurement of pre-desegregation achieve- 
ment (Wolman, 1964). Except at the kindergarten level, no statistically 
significant differences were found between those who transferred to inte- 
grated, middle class schools and those who stayed in a de facto segregated 
school, perhaps because as another study indicated (Luchterhand and 
Weller, 1965) more lower class families elected to transfer. It is also 
reported that in the year of the study the segregated school had the benefit 
of extra services (Kaplan, 1962). 

In 1964 the White Plains school board initiated a racial balance plan 
which involved closing one elementary school and bussing about 
black pupils from two sending schools to six receiving schools (White 
Plains Board of Education, 1967). Participation was mandatory. A three- 
year study was made of the IQ and Stanford Reading and Arithmetic Test 
scores of 33 of these black pupils (who entered grade 3 in 1964) in pre- 
dominantly white schools. Since segregated schools no longer existed in 
the city, the 33 were compared with 36 black pupils who in 1960 had 
entered the third grade in the segregated school. The newly desegregated 
children gained slightly more in paragraph meaning and arithmetic than 
the earlier central city children. However, in word meaning the segregated 
children gained more. The findings remain inconclusive due to a number 
of methodological limitations, especially the small number of students 
tested and the lack of a contemporaneous control group. 

Another small program involved bussing 75 pupils from central 
Rochester to a suburb, West Irondiquoit (Rock et al., 1966, 1967, 1968). 
Each year from 1965 to 1967, kindergarten teachers selected a pool of 
above-average pupils from which fifty names were drawn and assigned 
randomly, half to the experimental and half to the control group. Parental 
objections resulted in some shifts, and testing mortality was high. In 13 
of 27 comparisons on Metropolitan Achievement Tests over three years, 
there were significant differences between the bussed and non-bussed black 
children; in each instance the difference was in favor of the bussed children. 
sh be noted that the receiving schools were in a high quality school 
system. 

Beker (1967) reported on an experiment in Syracuse in which 60 of 
the 125 Negro elementary children bussed in the year 1964-1965 to a 
predominantly white school (Experimental Group) were compared with 
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35 children whose parents requested transfer but for whom places were not 
available (Control Group 1) and with 36 children whose parents refused 
transfer (Control Group 2). After the first year there was no significant 
differences in achievement gain among the three groups. This contradicts 
the U.S. Commission’s (1967) Report for Syracuse of greater gains for 
24 bussed children in comparison with an unspecified number of non- 
bussed children who had the benefit of a compensatory program. The 
latter gain may have been a Hawthorne effect in the first year of the 
program. Nevertheless, the small numbers involved and the lack of any 
control of SES make the Syracuse findings quite inconclusive. 

In Buffalo, New York in 1965, there was mandatory bussing of 560 
Negro pupils from closed and overcrowded schools to predominantly white 
schools in the same city (Dressler, 1967). Of these, 54 in grade 3 were 
tested and compared with 60 in a sending school. The author did not 
specify how these were selected. Comparison on reading showed greater 
gain over the year for bussed students. No controls on SES or other 
variables and no significance tests were reported. A later report from 
Buffalo (Banks and DiPasquale, 1969) described a 1967-1968 study of 
1200 fifth to seventh grade Negro pupils bussed from 6 segregated inner 
city schools to 22 integrated schools. Whole classes were selected for the 
transfer, Though there was no difference in control test scores, the main 
growth for the year for the integrated pupils was .83 and for the segre- 
gated pupils .56 (grade equivalence scores). There was, however, no 
assurance that the integrated children or those with available test scores 
did not have more favorable background or higher native ability than the 
segregated. 

There are two sources of information on the results of a bussing ex- 
periment in Philadelphia. The U.S. Commission (1967) reported that 
bussed Negro children of the same social class and reading grade level 
as Negroes in segregated schools with compensatory education (EIP) had 
by the third grade surpassed EIP children and equaled students of slightly 
higher SES in non-EIP schools. Laird and Weeks (1966) may have re- 


and fifth grade levels. When a smaller sub-sample of control and experi- 
mental children were matched on grade, sex and IQ, the bussed children 
made significant gains only on reading. 

The Berkeley, California experiment reported by the U.S. Commis- 
sion on Civil Rights (1967, p. 131) for the 1965-1966 year is corroborated 
by an evaluation of the following year’s bussing project (Jonsson, 1967). 
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Two hundred and fifty Negro students transported from segregated low 
SES schools in the Flats to integrated middle-class schools in the Hills 
made higher average gains than in previous years and higher gains than 
non-bussed students receiving compensatory education. Though SES data 
were not available on individuals, the students bussed to Hill schools were 
presumably lower class, since they came from a lower-class area of Berkeley _ 
(Sullivan, 1968). However, the Negro children selected for bussing were 
those “who were predicted to adjust well emotionally and academically 
to the new school” and parental consent was required (Jonsson, 1966). 
In other words, the children bussed to the Hills might well have been 
initially superior to their neighbors who remained behind. (See also Wilson, 
1963.) 


Under the auspices of the Metropolitan Council for Educational 
Opportunity (METCO), over 700 Negro students from Boston are being 
bussed at their parents’ request to schools in suburban communities. Dur- 
ing the first year of the program pre- and post-Metropolitan Achievement 
Test scores, available for 66 children in grades 3-8 in three school systems, 
showed significant improvement on reading, word, and spelling tests, but 
not on arithmetic. No control group was possible, since the Boston school 
system did not supply records on its students (Archibald, 1967). In his 
recent evaluation of the same program, Walberg (1969) solved the control 
group problem in an ingenious fashion by using siblings of the bussed 
children matched as closely as possible to them on age. As Walberg pointed 
out, the design does not guarantee the equality of groups, since there may 
be bias in the family’s choice of child to be bussed, but at least the family 
and neighborhood environments of siblings are similar. Further bias was 
introduced because only 47% of the 737 eligible METCO children and 
25% of the 352 eligible siblings were tested in both October 1968 and 
May 1969. Except that METCO children gained significantly less on 
mathematics at grades 5-6, there were no significant differences in achieve- 
ment between the two groups from grades 2-12. Neither sex, nor year in 
program (1-3), nor initial achievement level interacted with bussing status. 

Project Concern, which involves bussing Central Hartford, Connecti- 
cut Negro and Puerto Rican students to several suburbs in the metropolitan 
area, was more carefully designed (Mahan, 1967, 1968). Intact classes 
were randomly selected from eight eligible (85% or more nonwhite) ele- 
mentary schools in the low SES North End. All 300 children in these 
classes with an IQ of 80 or above were bussed, except 12 whose parents 
refused and a random 22 for whom no places were available. A control 
group of 305 children drawn from the same schools proved to be like the 
experimental group in grade distribution (K-5) but to have more girls. 
In the course of the two-year study, 25% of the experimental and 20% 
of the control group were lost through moves from the target area, dropping 
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out of the project or missing tests. A unique feature of this project is 
that, by selecting whole classes, central city teachers were released to 
accompany the pupils to their new schools and supply extra remedial and 
guidance services. Since not all bussed students received this supportive 
team assistance, it was possible to compare bussed students with and 
without compensatory education in their segregated schools. 

In his two-year evaluation the project director concluded that 1) 
“Youngsters placed in suburban classroom at grades K-3 have a signifi- 
cantly greater tendency to show growth in mental ability (WISC and Test 
of Primary Mental Abilities) than those remaining in inner city class- 
rooms”; 2) In measures of school achievement differences in lower grades 
are significantly in favor of the experimental group, but in the upper 
grades in favor of the control group; 3) Suburban placement with special 
supportive assistance proved more effective than suburban placement with- 
out such assistance; and 4) “There is no evidence that special supportive 
assistance is an effective intervention within inner city schools.” 

The final bussing study reviewed here is the largest—the New York 
Open Enrollment Study (Fox, 1966, 1967; Fox et al., 1968). Since 1960 
some 22,300 pupils have, on their parents’ initiative, transferred from 
predominantly Negro and Puerto Rican “sending” schools to predominant- 
ly white “receiving” schools. Fox (1968, p. 27) concluded: “When children 
who entered O. E. in 1962 were matched in initial reading ability with 
children who remained in the sending school, data from the 1965-66 study 
indicated no difference between them in reading ability. The 1966-67 
study found that unmatched, randomly selected samples of O. E. children 
were reading at higher levels than randomly selected samples of sending 
school children. These findings suggested to the investigator that the O. E. 
children did not reflect the full range of ability in the sending schools 
and that academically more able children entered the O. E. programs.” 

Investigators in five of the nine bussing studies here reviewed found 
greater gains for desegregated children than for segregated children, but 
the case for the beneficial effect of desegregation is marred by several 
methodological shortcomings. The numbers involved were not large, and 
(more serious) in all cases the number tested is considerably smaller than 
the number bussed. This alone would jeopardize the randomness of the 
sample, even if the experimental and control groups were randomly drawn 
from the same pool, but in no case is there assurance on this point. Staff 
selection or parental self-selection always played a part, even in Hartford, 
where assignment was most nearly random. Therefore, it is possible and 
likely that more favorable home background and “achievement press” ex- 
plains the somewhat better performance of bussed pupils. In none of the 
Studies was there a careful attempt to evaluate the equality of education 
in integrated and segregated classrooms. In the Boston, Rochester, and 
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Hartford experiments there was the further complication of bussing out of 
a central school district into suburban districts where schools have benefits 
that ampler budgets provide. Therefore, there is no way of comparing the 
effects of the rival independent variables of school quality and school 
ethnic or economic composition. The short duration of most of the pro- 
grams—too short to offset the stimulation or trauma of transfer—is another 
reason for concluding that the over-all effectiveness of desegregation via 
bussing programs has not yet been demonstrated and must await further 
evidence. (For more details on bussing experiments, see Matthai, 1968.) 


The Wilson Study 


Of all the studies on the relation of school ethnic composition to 
minority group performance, the one with the most adequate design was 
Alan Wilson’s (1967) survey, reported in an appendix to the Civil Rights 
Commission Report. The sample was a stratified random sample of more 
than 4,000 junior and senior high school students in the San Francisco 
Bay area. The design is a cross-sectional comparison of verbal test scores, 
according to the racial and social class composition of neighborhoods and 
schools, but longitudinal control is introduced by the data on school racial 
and social class composition at each grade level and first grade individual 
mental maturity test scores. Wilson argued that controlling on these test 
scores equates children on the effects of genetic differences and preschool 
home environment, so changes can be attributed to new (school?) ex- 
periences and not to uncontrolled initial differences. 

Although the sample is large (over 2,400 Negroes), analysis of the 
separate effects of neighborhood and school segregation or of racial 
social class segregation is hampered for Negroes by the confounding of 
these variables, and the fact that few Negroes live in integrated neighbor- 
hoods. Nevertheless, Wilson showed by regression analysis that after con- 
trolling for variation in first grade IQ, the social class of the primary school 
had a significant effect on sixth grade reading level and the social class 
of the intermediate school had a significant effect on eighth grade verbal 
reasoning scores. School racial composition, however, had no significant 
effect on achievement over and above school social class (pp. 180-84). 

Other than the small size of the numbers in some of the cells, there 
are further limitations to this study. First grade scores were presumably 
available only for the most stable members of the sample, and its representa- 
tives may have been affected by attrition. Children were matched only by 
father’s occupation and primary mental maturity scores; in other respects 
segregated and integrated children could have been quite different. Par- 
ental and school social class assignment based on the questionnaire replies 
of students are potentially inaccurate. No evidence is offered as to 
equality of segregated and integrated schools in Richmond. But in spite 
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of these quibbles, the study is impressive in design and quite convincing 
that in this community, at least, racial integration per se was not signifi- 
cantly related to the academic performance of Negroes. 


Conclusion 


The literature reviewed offered some evidence as to the relation be- 
tween school racial composition and academic achievement, but much 
more evidence as to the difficulties of research in this area. 

The “before and after” studies of the desegregation of school systems 
or individuals suggest that following desegregation, of whatever type or at 
whatever academic level, subjects generally perform no worse, and in most 
instances better. Those studies in which the same individuals were 
measured at Time 1 and Time 2 (Lee, 1951; Katzenmeyer, 1963; Wessman, 
1969; and Clark and Plotkin, 1963) have thus largely ruled out the endur- 
ing characteristics of the subjects and factors in their past (SES, IQ) as 
explanation of the change. But interaction between desegregation and 
quality of schooling has not been ruled out as the explanation of the 
difference. In fact, desegregation in Washington, D. C. reportedly brought 
an upgrading of education and in Louisville gave a psychological boost to 
teachers, Such changes could well explain the gain in achievement in 
those cities and in situations involving more classroom desegregation. 

The general finding of the pre-Coleman cross-sectional studies re- 
viewed is that achievement levels are higher for desegregation than for 
segregated pupils. Several of these studies were so small-scale and statis- 
tically limited that researchers would have little confidence in the gen- 
eralizability of their findings, if they were not in agreement with studies 
such as those by Matzen or St. John and Smith which with larger samples 
and better controls also found higher achievement for the desegregated. 
However, for all these studies the unresolved question is: Were the de- 
segregated students a select group to start with? 

The Coleman data are extensive and have been analyzed with statis- 
tical finesse. However, the attempt to draw conclusions from the data is 
handicapped by the cross-sectional design, the unconvincing measures of 
social class, the failure to separate the effects of neighborhood SES and 
of school quality, the imprecise and non-longitudinal measure of school 
ethnic composition. In spite of such limitations, the survey provides fairly 
convincing evidence for the existence of a powerful relation between social 
class integration and achievement. The evidence is less clear for a residual 
relation between racial integration and achievement. The effect appears 
to be small, but could be either exaggerated or masked by inadequate con- 
trol of school quality and home background characteristics. 

In theory, investigators using four-celled studies can avoid most of 
the weaknesses of both panel and cross-sectional research. But no investi- 
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gation to date has been able to meet all the canons of pure or quasi- 

experimentation. The matching problem plagues all attempts to g 
naturally segregated and non-segregated populations. Wilson (1967) 
achieved a post facto “before” measurement by controlling on primary — 
grade mental maturity, but this procedure did not control on all variables 
and may have masked the effect of racial segregation. 

If, in bussing studies, subjects could be randomly assigned to experi- 
mental and control groups, the matching problem would be avoided; but 
politics and parental preferences seem invariably to bias the selection. 
Further bias is introduced by differential subsequent dropout from ex- 
perimental or control groups as some children leave town, leave the pro- 
gram or are not tested. The small number of children involved in most 
bussing experiments not only handicaps statistical tests of their effective- 
ness, but also may add to the Hawthorne effect for those involved. The. 
stimulation or embarrassment of being a guinea pig or a newcomer is 
probably short-run and can be discounted if the experiment is of long 
enough duration. But the effect of riding a bus to a community other 
than one’s own might be continuing and could only be controlled if 
students were bussed both to segregated and to integrated schools. 

The laboratory experiments of Katz (1964, 1968) and the lessons he 
draws from them are very convincing as to the “threats” and “facilita- 
tions” involved in the process of desegregation. Though as yet unsupported 
by adequate field research, the most plausible hypothesis is that the re- 
lation between integration and achievement is a conditional one: the 
academic performance of minority group children will be higher in inte- 
grated than in equivalent segregated schools, providing they are supported 
by staff and accepted by peers. As evidence for the first condition there 
is the report from Hartford that bussed students who received staff support 
in their new schools showed greatest gains (Mahan, 1968). As evidence 
for the second condition, there is the findings of the U.S. Commission on 
Civil Rights (1967) on the importance of interracial friendship to achieve- 
ment in an integrated setting. In this review, I have perforce ignored the 
growing and important literature on the relation of ethnic integration 
and self-concept on one side and of self-concept and achievement on the 
other. As Wilson (1967) and Pettigrew. (1968) suggest, researchers must i 
assume a very complicated, two-way process in which the three variables 
ari Support by staff and acceptance by peers undoubtedly contribute 
o : 

In rapidly changing times the nature of variables and their inte 
lationship may change. This review has revealed rather inconclusive 
evidence of a relation between ethnic integration and achievement. B 
the research examined refers to the immediate or distant past. The mean- 
ing of integration may be changing, and the conditions under which i 
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is implemented can be made different in the future. One good reason that 
there has been no adequate research to date on the effect of integration is 
that there have been no adequate real-life tests—no large-scale, long-run 
instances of top-quality schooling in segregated minority-group schools. 
Until our society tries such experiments, researchers will not be able to 
evaluate them. 
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ide Columbia University 


Although one of the traditional goals of education in the 
United States has been to prepare citizens for participation in a democratic 
society, American public education in large cities has been characterized by 
centralization, standardization, and professionalization which allow for 
little democratic participation. In general, the moves toward centraliza- 
tion in urban and rural areas have been progressive: centralization has 
provided uniform and equal educational opportunity, raised professional 
standards and created efficient and economical systems. In some instances, 
however, increased centralization resulted from political momentum rather 
than educational planning and from an unquestionable faith in the effi- 
ciency of power accumulated at a single point. 

Three questions are inherent in any evaluation of a centralized or 
decentralized political system: 1) to what extent are the primary needs 
and expressed wishes of clients of the system represented in the process? 
2) are the identification and involvement of the clients with the process 
advanced or retarded? and 3) is the system maximally efficient in accom- 
plishing its purpose? In education, goals are largely defined in terms of 
preparing individuals for functioning in a democratic society; thus the three 
questions are interrelated. 


Centralization and Participation in Urban School Systems 


In the late 1930’s and 1940’s Mort and Cornell (1941) conducted a 
number of studies on school administration and formulated an important 
Measure of the effectiveness of school systems. From their research they 
maintained that the educational quality of school districts could be mea- 
sured by their adaptability to change. Curricular innovations, new types 
of classes and classroom structures were among the variables indicating 
this capacity. Using adaptability as an index, Mort and Cornell found 
Correlations between a district’s adaptability rating and such character- 
istics as its financial policies, its size, and the degree of lay and professional 
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participation in the district. According to their studies, two-thirds of the 
differences between adaptive and non-adaptive school districts could be 
ascertained without going into the schools. In a similar investigation 
which dealt with the problems of the big-city school, Mort and Vincent 
stated (1946, p. 88): 


Education in many ways is hampered in the large city, because 
here, as nowhere else among American schools, education is 
centrally controlled. It is as though the schools of your village 
were run by somebody way off at the state capital. You have 
no voice, no control, your questions go unanswered, your demands 
on the local administrator are parried by: “I’m sorry, but that 
matter is completely out of my hands; you will have to go to 
headquarters.” But you can never get close enough to the man 
at headquarters who makes the decisions, and you give up. 


After Cillie (1940) in his study of school organization related big- 
ness to inflexibility and powerlessness at all levels of the administrative 
structure, studies by Mort and Cornell on school administration pinpointed 
the maximum effective school district size as 100,000 pupils. This esti- 
mate was supported by Ross (1958) and by Leggett and Vincent (1947) 
in their study comparing New York City schools with other school systems. 

Since the evidence indicated that school district size could either limit 
or enhance the quality of education, investigators began to look into the 
possible relationship between district size and community initiative and 
control. Community control seemed to be a way of creating a more 
flexible and efficient system with greater potential for meeting the needs 
of individual communities, Hicks (1942, pp. 172, 174) hypothesized: 


Adaptations initiated by the central office will be less well under- 
stood and less extensively developed than those which spring from 
within the community, i.e., they will not over a given period of 
time have reached (a) the degree of depth, or (b) the extent of 
ve comparable to those introduced through the force of initia- 
tive. 

When cities are comparable in size and expenditure, those pro- 
moting the greatest extent of local freedom will rank highest in 


adaptability, and their teachers highest in the understanding of 
modern educational issues. 


Westby (1947, p. 66) stated that local autonomy could neither be 
established nor assured by granting more power only to principals and 
superintendents: “the people of the community must have the power to 
make decisions that will have a real effect on the operations of schools and 
the means by which these decisions can be translated into action.” Jansen 
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(1940) noted the absence of local initiative in certain areas of New York 
City, particularly where apartment houses predominated, and suggested 
that the public school might provide a unifying force to stimulate initiative 
in other areas as well as in education. 

One of the most comprehensive and relevant investigations of the 
issue of district size and participation was conducted by Gittell (1965), 
who analyzed the role of the major decision-makers in the New York 
City public school system. She examined decision-making and administra- 
tion in five policy areas—budget, curriculum, selection of superintendents, 
salaries, and integration—and maintained that the public school system 
lacked channels for effective authority, even at the central level where 
all the power was ostensibly located. According to Gittell, the size of 
the system was a major difficulty: more than 3,000 persons were employed 
in the central bureaucracy and the operational field staff included 2,200 
principals and assistant principals, 31 district superintendents, and 740 
department chairmen. Over-centralization made any innovation nearly 
impossible to execute. 

As examples of the limitation imposed by size and overcentralization, 
Gittell analyzed the roles of some major “participating” bodies: 1) the 
Board of Education, until recently the most powerful body, operated 
largely to balance conflicting pressures and interests. In the five decision 
areas measured by Gittell, the Board’s role ranged from superficial par- 
ticipation (budget) to formulation of policies which it failed to execute 
(integration), to early negotiation followed by inability to carry through 
on responsibility (teachers’ salaries), to its most active role, the selection 
of a superintendent. 2) the superintendent of schools had nominal power 
in a number of areas, but because typically he had risen within the New 
York school hierarchy, he was unlikely to take a position different from 
that of the general bureaucracy. Constraints by the Board of Education 
limited him further in instituting change. 3) Local school boards had 
almost no authority in the determination of school policy; they usually 
played the role of community buffers, holding hearings and discussing 
narrow local issues. 4) District superintendents made no decisions re- 
garding the distribution of funds and had only limited discretion in as- 
signment of personnel; mainly they operated as buffers for those parent 
dissatisfactions unresolved by the school principal. 5) The teachers’ union 
kept to issues involving “professional interests” (job security, salaries, etc.) 
and avoided entering such areas as curriculum development or instruc- 
tional methods, where they were actually qualified. Local and civic groups 
such as the United Parents Association or the Parent-Teachers Association 
had little effect on decision-making; their potential power as pressure 
groups was usually lost because of the time and red tape involved in 
getting any action. Civil rights groups also lost most of their momentum 
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through delays by the bureaucracy during attempts to integrate the 
schools. 

Gittell’s description of the manner in which decisions were made about 
curriculum is particularly interesting. A small cadre of professional admin- 
istrators was responsible for curriculum development and change, which 
occurred every three years during a rushed period before books and ma- 
terials were ordered. The persons who knew how curriculum should be 
developed—curriculum specialists, principals and teachers—did not partic- 
ipate. Nor did parent or community groups have any voice in the decisions. 

Once a program was developed, curriculum coordinators and their 
assistants presented guidelines for its implementation to the principal, 
who was responsible for integrating it into the school’s existing curriculum 
and introducing it to his teachers. But, because of his limited ability in this 
area and his restricted time for teacher training, the principal typically 
transferred his responsibility to the teachers. In practice, however, even 
the teacher did not significantly expand or modify the guidelines. Thus, 
administrators external to the school and classroom formally initiated pol- 
icy and informally influenced its implementation. 

Several investigators documented the powerlessness which school prin- 
cipals, department heads, and teachers feel because they have no ma- 
chinery for making their needs known or for participating in decisions in 
which they are experts and which truly affect them (Becker, 1953; Chesler 
et al., 1963; Griffiths, 1963; Willower, 1963; Hornstein et al., 1968). As 
Becker (1953, p. 140) pointed out although parents are probably less able 
to effect change than those formally within the school system, their posi- 
tion outside the hierarchy makes them more threatening to school officials 
than if they had a legitimate channel for participation: 


To the teacher, then, the parent appears as an unpredictable and 
uncontrollable element, as a force which endangers and may even 
destroy the existing authority system over which she has some 
measure of control. For this reason teachers (and principals who 
abide by their expectations) carry on an essentially secretive rela- 
tionship vis-a-vis parents and the community, trying to prevent 
any event which will give these groups a permanent place of 
authority in the school situation. 


Organizational Size and Participation in Decision-Making 


Support for the correlation between the size of a group and opportuni- 
ties for participation, and in turn the satisfaction and performance of its 
members, comes from quite various experimental studies conducted in and 
outside of education. Most of these studies of group behavior are based on 
much smaller populations than even a medium-sized school system. For- 
tunately, despite problems in translating findings of experimental studies 
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done with small samples to the large populations, it is commonly agreed 
that the processes of small groups can be used as models in understanding 
large organizations (Verba, 1961). 

Several investigators concluded that group satisfaction, morale, pro- 
duction, and participation are functions of the size of the group. Barker 
and Gump (1964) summarized the findings of industrial, commercial and 
social group research; some of the following studies were cited in their 
summary. Tallachi (1960) studied 93 industrial organizations and found 
negative correlations between organizational size and worker satisfaction 
and between satisfaction and absenteeism. Using various industrial and 
commercial organizations, the Acton Society Trust studies (1953) revealed 
the effects of size on a number of indices of worker morale. Interest in the 
affairs of the organization and knowledge of the names of administrators 
decreased as the size of the organization increased, as did voting on work 
unit issues, subscriptions to professional periodicals, output, and punctual- 
ity. Conversely, the investigators found that acceptance of rumors, 
absenteeism, accident rates, strikes, and waste increased as the size of the 
organization increased. A study of two automobile factories indicated a con- 
sistent negative relationship between size of the work unit and individual 
productivity or output (Marriot, 1949). Finally, in an investigation of 96 
business organizations, Indik (1961) found that size correlated positively 
with difficulty of maintaining communication among members and nega- 
tively with members’ participation. 

Although only the Acton Society Trust and Indik studies dealt directly 
with the relationship between size and participation, the latter is probably 
a dependent variable in all of them. Several additional investigators iden- 
tified correlations between size and participation. Barker and Gump 
(1964) discovered a consistent positive correlation between school size and 
the level and type of participation of the students and between participa- 
tion and student morale. Identity crises, for example, were far more prev- 
alent in the large schools studied. Mort and Cornell (1941) found that 
when school size is held constant, even district size can enhance or diminish 
participation of parent and teacher groups: 

In the United States, participation in decision-making is popularly 
accepted as inherently good. In the last thirty years scientific studies of 
small group situations have generally supported this attitude from a more 
objective standpoint. The participation hypothesis maintains that “. . . sig- 
nificant changes in human behavior can be brought about rapidly only 
if the persons who are expected to change participate in deciding what the 
change shall be and how it shall be made” (Simon, 1955). Lewin, Lippitt, 
and White (1939) conducted a series of experiments in which children and 
adults fulfilled different tasks under three leadership styles: democratic, 
laissez-faire, and authoritarian. The investigators found that members of 
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democratic groups who were given an opportunity for participation in 
decision-making were more satisfied and enthusiastic about the task and 
maintained a higher level of production than members of authoritarian 
groups. These findings were replicated in many similar studies in educa- 
tional and industrial psychology, in training programs in business and 
government, and in community planning. Verba (1961), in citing a num- 
ber of such studies, conjectured that under such situations of participatory 
decision-making members of the group identify with the task and are rein- 
forced directly by accomplishing it; their rewards come from rational 
decision-making in approaching the task as well as in greater productivity. 

The relevance of degree and legitimacy of participation for satisfaction 
and production is suggested in two other studies. To examine ways of 
effecting changes in the production methods in an industrial firm, Coch 
and French (1948) created three different work groups. In the control 
group, changes were introduced by management decision and the members 
of this group in no way influenced the change in the process of production; 
in a second group, the “partial participation” group, the changes were 
made by representatives selected by the group and in the third, the “total 
participation” group, all the members worked directly in making decisions 
about changes. Coch and French found that the production of the control 
group dropped after the changes were introduced and that they became 
hostile towards management; the partial participation group, however, 
continued to produce satisfactorily after a momentary drop in production, 
and the total participation group quickly exceeded its pre-change rate of 
production and remained satisfied with the job. This study was replicated 
in a Norwegian factory by French, Israel, and As (1960); the investigators 
found that production did not increase as a result of the workers having 
participated in decision-making. They attributed this to the fact that the 
decision in which the groups participated had little relevance to production. 
These findings suggest the need to distinguish between token and legiti- 
mate participation. They imply that participants must feel that their 
participation is meaningful and related to the immediate tasks. 

Although the preceding studies suggested the value of participation on 
morale and feelings of satisfaction, they concentrated on the necessity of 
participation for organizing change and maintaining production. However, 
research indicates that participation in decision-making enhances both 
the instrumental and the affective realms of human behavior in complex 
organizations. 

Flanders (1951) and Faw (1949) showed that in student-centered, 
democratic, or participatory classrooms, there was less hostility toward the 
teacher, less tension among the students, and sometimes greater actual 
learning. In a study by Flizak (1967) of the behavior and attitudes of 
teachers working in authoritarian, rationalistic, or humanitarian school 
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structures, the degree to which the teacher was able to participate in school 
decision-making was correlated with her interaction with her students. 
Teachers in the authoritarian school structures tended to be rated as dis- 
ciplinarians and information-givers; those in more rationalistic structures 
scored higher on motivation of students, and teachers in humanitarian 
school structures were viewed as fulfilling a counselor role. Obviously, 


indicate greatest satisfaction with their principal and school system when 
they perceive that they and their principals are mutually influential. 


an institution. When low-income parents were divided into groups accord- 


ahead means obtaining or providing a good education,” “Education comes 
to mind when they think of a good life for boys and for girls,” and “Edu- 


one can not conclude from these data either that people who are more 
interested in education are more likely to become involved in the public 
schools or that activity in school affairs increases interest in education. 
However, there is an element of logic and folk belief that would seem to 
support the latter assertion. 

Attitudes and behavior of many groups in large urban school systems 
indicate the degree of discontent which the existing structure has gen- 
erated. In the present systems, no group is small enough to participate 
meaningfully in decision-making. From the studies previously cited, one 
can hypothesize that if group size were decreased through decentralization, 
participation could be enhanced for parents, teachers, and administrators, 
thus creating more productive and more satisfied participants. 


Parent Involvement in Education and Pupil Development 


A child’s educational development depends upon a dynamic interac- 
tion between the parent and the school. Although this interaction gener- 
ally has been limited in the public school situation, several studies showed 
that even circumscribed participation by parents in school affairs correlates 


with heightened pupil development. 
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In a study of the effects of contacts between parents and school per- 
sonnel on student achivement, Schiff (1963) reported that parent partici- 
pation and cooperation in school affairs lead to pupil achievement, better 
school attendance and study habits, and fewer discipline problems. An 
analysis of the gains on a reading test which was administered to experi- 
mental and control groups of children revealed that pupils of the experi- 
mental (parent-participation) group improved significantly more than did 
pupils of the control group. 

From personal observations of compensatory programs in various parts 
of the country, Jablonsky (1968, p. 6) reported that “schools which have 
open doors to parents and community members have greater success in 
educating children. . . . The children seem to be direct beneficiaries of the 
change in perception on the part of their parents.” 

Hess and Shipman (1966, p. 35), in a study of the effects of mothers’ 
attitudes and behavior toward their children in test situations, concluded: 
“Engaging parents in the activities of the school in some meaningful way 
may indeed assist the child in developing more adequate and useful images 
of the school, of the teacher, and of the role of the pupil.” 

Rankin (1967) investigated the relationship between parent behavior 
and achievement of inner-city elementary school children and found sub- 
stantial differences between the attitudes and behavior of mothers of high- 
-achievement and low-achievement children. The ability of the mothers to 
discuss school matters and to initiate conferences with school officials were 
two of the general areas in which differences were most often found. 

Brookover et al. (1965) compared the development of three randomly- 
assigned low-achieving junior high school student groups: one group re- 
ceived weekly counseling sessions, the second had regular contacts with 
specialists in particular interest areas, and the parents of the third group 
had weekly meetings with school officials about their children’s develop- 
ment. At the end of the year the first two groups showed no greater 
achievement as a result of their special treatment. However, the third 
group, whose parents had become more intimately involved in the school 
and in their children’s development, showed heightened self-concept and 
made significant academic progress during the year. 

Parent involvement in the school not only is associated with attitudes 
and behavior but seems to influence teacher attitudes toward children. 
Rosenthal and Jacobson (1968) reported that children who profited from 
positive changes in teachers’ expectations of their ability all had parents 
who were involved to some degree in their child’s development in the 
school and who were distinctly visible to the teachers.* 


*Although Thorndike (1968) and Snow (1969) have questioned the data upon which 
Rosenthal based his conclusions, even Thorndike, whose criticisms are more encom- 
passing, tends to agree with the basic conclusions of the study. In this instance, ques- 
tions about the manner in which positive gains by students were determined, while 
technically relevant, do not significantly reduce the strength of Rosenthal’s conclusions. 
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Although parents are for the most part only marginally involved in 
the schools, in Rough Rock, Arizona, the Navaho Indian parents and 
community have votes on all matters of school policy and sit on local 
school boards. Before parents and other community members obtained 
control of the schools, the Bureau of Indian Affairs had tried unsuccess- 
fully for many years to increase school achievement and lower the dropout 
rate. Reports from Rough Rock indicated that involvement of parents in 
the process of education triggered student enthusiasm for learning, largely 
by making the school an integral part of the community and recognizing 
the importance of native Indian culture (Roessel, 1968). 

These investigations suggest several possible hypotheses about the 
manner in which parent involvement affect pupil development, particu- 
larly academic achievement. The parent's participation may make him 
more visible to school personnel and to his child, which may indicate to 
both that educational values are upheld by the family. Parent participa- 
tion at the same time may change the attitudes of the parents toward the 
schools and toward the goals of education. ‘And as studies such as those 
by Hess and Shipman indicate, when parents are involved in the process 
of education, they may come to acquire certain skills of teaching which can 


The active participation of parents in school affairs and other com- 
munity and political activities may also enhance cultural identity and self- 
concept, which in turn raise achievement. Stating what many other investi- 
gators have felt, Chilman (1966) noted that the parental patterns most 
characteristic of the very poor are an anticipation of failure and a distrust 
of middle-class institutions such as schools. Youth in the Ghetto, the clas- 
sic study of life in central Harlem, documents that children growing up in 
the inner city sense almost immediately their parents’ feelings of power- 
lessness and quickly assume that they too have little or no control over 
their fate (HARYOU-ACT, 1964). In an analysis of the political social- 


ization of blacks, Seasholes (1969, P- 58) wrote: 


In the end, the most serious consequence of Negro frustration, 
and noninvolvement in polities is the possibly deleterious effect on 
the Negro’s own evaluation of himself. The Negro who sees poli- 
tics as a conspiracy against him may or may not have a low politi- 
cal self-image. The Negro who traces his political insignificance 
to his own shortcomings does: “They don’t care because I am 


worthless.” (Seasholes’s emphasis) 
In Equality of Educational Opportunity, the largest study of achievement 
among minority group children, Coleman et al. (1966) concluded that the 
child’s sense of control over his environment is one of the strongest factors 
influencing his achievement. The authors suggested that for children from 
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disadvantaged groups, achievement appears to be influenced by what they 
believe about their environment: whether they believe it will respond to 
reasonable efforts or whether it is instead immovable or merely random. 
The child’s sense of control over his environment may be more important 
to achievement than school characteristics. According to the study, al- 
though in important ways the attitudes of black and white students toward 
school and academic work differ little, the black students are less likely to 
expect that they will go to college or even obtain a job that will require 
advanced education. Black children have considerably less confidence in 
their ability to control their environment than do white children. They 
tend to agree with such statements as: “People like me don’t have much of 
a chance to be successful in life”; “Every time I try to get ahead, some- 
thing or somebody stops me”; or “Good luck is more important than hard 
work for success.” 

The Cloward and Jones study of low-income, working-class, families 
on the Lower East Side of New York City confirms this finding of the ef- 
fect of minority status and social class on the belief that work and educa- 
tion will result in getting ahead. Although middle-class parents tended to 
believe that schooling and hard work resulted in success, low-income par- 
ents felt that success was largely related to “whom you know” or “luck.” 
However, it is important to note that Cloward and Jones stressed that 
parents of all classes who were involved in the schools were likely to be- 
lieve that the school and education could actually effect change in their 
children. Their participation in the school may have given them a greater 


sense of fate control than those parents who were not involved in school 
matters. 


The sense of control of one’s destiny is only one of a number of affec- 
tive variables which have been found to significantly influence develop- 
ment. Other related variables—self-esteem, motivation, level of aspira- 
tion, peer relationships, teacher attitudes, and the general school and home 


environments—also are acknowledged as important in the child’s develop- 
ment. 


Throughout the twentieth century, educators have attributed varying 
degrees of importance to these affective variables. Because of the influence 
of Dewey early in the century, educators felt that schools could best teach 
children by developing them emotionally and socially. This emphasis was 
shifted after the Sputnik crisis in the 1950’s, which created pressure in 
American education to rapidly produce students with highly sophisticated 
and specialized intellectual achievement and cognitive skills. However, in 
the last few years several investigators have become somewhat skeptical 
about the possibility of significantly influencing performance through 
changes in basic cognitive processes. They consider it likely that cognitive 
intervention can not promote significant changes in the quality of the 
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child’s intellectual functioning without changes in his affective processes— 
aspiration, commitment, motivations, attitudes. Zigler (1966) suggested 
that the affective areas of development may be far more amenable to 
change than the cognitive areas and that when significant changes in the 
quality of intellectual development occur, they may be more related to 
prior changes in the affective domain than to cognitive intervention. Thus, 
he and others tended to deemphasize the need to create new learning de- 
vices and focused on changing the learning environment and improving the 
relationship among school, family, community, and ethnic reference 
groups. 

‘An affective area which shows potential for enhancing the perform- 
ance of low-income and minority group children is the improved self-con- 
cept resulting from active parent participation in the school. It is now felt 
that parent involvement can integrate the child’s school and home life and 
provide him with a model of participation and control in a major area of 
his life. 


Community Identity and Educational Achievement 


Since the 1954 Supreme Court school desegregation decision, educators 
have focused on changes in school ethnic composition as one means of 
creating quality education for minority group children, Communities have 
responded to the demands of the courts for desegregation with a series of 
plans and programs: open enrollment, bussing, rezoning, school site selec- 
tion, and school construction (including educational parks and complexes). 
Most of these plans have been partially achieved, at best. Bussing, for 
example, has generally resulted in a one-way flow out of the ghetto school 
and into the middle-class white school with little or no reciprocation and 
relatively little integration within the white school. St. John (1968) 
pointed out that “resegregation” has occurred through tracking as well as 
through the white exodus from the cities to the suburbs. In large urban 
areas, the departure of white families has left the inner-city schools with 
large proportions of youngsters from black and Puerto Rican homes. The 
possibility of instituting any meaningful degree of school integration is 
becoming unlikely, particularly in the absence of enthusiasm for metro- 
politan as opposed to city-bound school districts. 

Recognizing the ineffectiveness of past efforts to integrate the schools, 
both educators and minority group parents now accept that the neighbor- 
hood school will continue to exist and may even have intrinsic value. 
Thus, those concerned with quality education emphasize the importance 
of strengthening the integrity of the neighborhood school and the com- 
munity it serves. School integration as a priority has been put aside, at 
least for the moment. 

Such a decision may appear regressive, considering the number of 
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studies which showed educational achievement to be higher in integrated 
than in segregated schools. However, as St. John points out in chapter 5 
in this issue, studies on the effects of ethnic integration on school achieve- 
ment are far from conclusive. Moreover, there are confounding variables 
which indicate that the integrated school may not be the only setting in 
which the achievement of minority group children can be raised. For ex- 
ample, the finding of Coleman and others that the black child’s sense of 
control over his fate was greater in the integrated environment may show 
that those children who attend integrated schools had parents who actively 
worked towards achieving integrated education for their children, and 
thereby acted as models for fate control. Achievement in integrated schools 
is likely to have been enhanced because these schools, being in better 
neighborhoods, are also generally better equipped and have staff who 
themselves do not feel “deprived” by their working environments. Finally, 
black children attending integrated schools are generally from families of 
higher socioeconomic status than those in segregated schools. 


A number of studies suggest that although largely white schools with 
a small proportion of minority groups may offer the best conditions for 
producing school achievement, the value of community and group integ- 
rity has been severely underplayed. Data from the U.S. Civil Rights 
Commission (1967) study, Racial Isolation in the Public Schools, show 
that although achievement was greatest in the predominantly white inte- 
grated schools, students in 90% “segregated” schools in black neighbor- 
hoods had higher achievement scores than those attending schools with 
an approximately 50-50 ethnic composition. 


A study of juvenile delinquency rates in various tracts in Baltimore 
sheds light on the effects of community integrity from another vantage 
point (Lander, 1954). Controlling for economic factors identified such 
variables as deteriorated housing, low rentals and overcrowding, Lander 
found that delinquency rates increased as the proportion of blacks in a 
neighborhood went from 8% to 50%. However, as the black population 
increased beyond 50%, black delinquency rates tended to decrease with 
the areas of 90% or more having the lowest rates. According to Lander, 
delinquency in Baltimore is fundamentally related to the anomie of a 
neighborhood and conversely a lack of delinquency is related to neighbor- 
hood stability and identity. 

From an analysis of the academic achievement of Catholic children, 
Greeley and Rossi (1966) concluded that “religio-ethnic” identity provided 
by the ghetto atmosphere of the Catholic schools is an important correlate 
of student performance. The educational achievement and later job suc- 
cess of Catholics who attended parochial schools compared favorably with 
that of other Catholics who attended the best public schools in the country. 
This high achievement can be attributed to the dedication of the Catholic 
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teacher; the students may also be motivated to achieve through the group 
identification and pride which the school encourages. The authors main- 
tained that their findings “call into serious question the assumption that 
it is necessary, for the health of society, that the religious and religio- 
ethnic ghettos be eliminated.” Greeley and Rossi hypothesize that the 
identity provided by these ghettos may work not only to promote achieve- 
ment, but also to further the Catholic child’s general acceptance of indi- 
viduals from various ethnic groups: “In the long run, [the ghettos] may 
even promote greater tolerance, because they give a person a relatively 
secure social location and a fairly clear answer to the difficult question, 
‘Who am I? ” 

A report on the “Education of American Indian Children” summarized 
research pointing to the need for self-determination and self-sufficiency 
of the American Indian which creates the psychological well-being neces- 
sary for successful learning (Gaarder, 1967). The author’s recommenda- 
tions are based on 


. . . the principle of self-determination (including the choice of 
language) and the belief that the only road of development of a 
people is that of self-development, including the right to make its 
own decisions and its own poems and stories, revere its own 
gods and heroes, choose its leaders and depose them—in short, to 
be human its own way and demand respect for that way. 


By the time a child enters school he has already developed an indi- 
vidual and cultural identity; for minority group and low-income children, 
this identity has been viewed as a disadvantage. One of the reasons why 
these children have been considered “culturally” or “educationally dis- 
advantaged” is that the schools have been less successful in educating them 
than they have in educating middle-class children. Educators have as- 
sumed that one instructional system could be applied to all children— 
with not even the tribute of respect paid to minority cultures—and that 
the success of all children could only be measured in terms of their adapta- 
bility to the uniform standards implicit in this system. The inability of 
the “disadvantaged” students to profit from even such special arrangements 
as the various compensatory education programs may be due to the actual 
irrelevance which the curriculum and instruction has had to their lives 
as well as to the alienation of these children and their parents from the 
procedures of the school. 

Thus, for the schools to be most effective, change may be needed in 
the school and in the relationship between the school and the community; 
education will probably haye to become more relevant to the students, and 
community and cultural integrity will have to be recognized. Community- 
originated and community-controlled education provides one means of 
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effecting needed change. Local control should enable the communities to 
identify their special needs and to legitimately change the schools to meet 
them. 


Conclusion 


Although the democratic tradition of the United States presupposes 
that citizens will actively participate in political decision-making, political 
and administrative momentum has often led to increased centralization 
of power, varying degrees of representation rather than participation, and 
the alienation of citizens from decisions which affect their lives. In educa- 
tion, the rise of big-city school systems has widened the gulf between 
decision-makers and those affected by the decisions, and many school 
systems are now too large to sensitively administer to the needs of their 
clients. In New York City, particularly, the social and political distance 
between the growing population of black and Puerto Rican families and 
the educational decision-makers has shown the shortcomings of a highly 
centralized bureaucratic decision-making process. These groups feel they 
have little access to power in educational and other social-political insti- 
tutions, and since they have found the public school ineffective in fulfilling 
their needs, they have become unwilling and at times hostile second-class 
participants in society. 

Investigations of the effects of participatory decision-making in creating 
positive changes in the affective and instrumental behavior of the partici- 
pants consistently demonstrate the importance of actively involving indi- 
viduals in decisions which affect them. Educational research indicates 
that when parents of school children are involved in the process of educa- 
tion, their children are likely to achieve better. This heightened achieve- 
ment may be due to the lessening of distance between the goals of the 
schools and the goals of the home and to the positive changes in teachers’ 
attitudes resulting from their greater sense of accountability when the 
parents of their students are visible in the schools. The child may also 
achieve better because he has an increased sense of control over his own 
destiny when he sees his parents actively engaged in decision-making in 
his school. Moreover, from the heightened community integrity and ethnic 
group self-esteem, which can be enhanced through parent and community 
groups effecting educational changes, the child will have a greater sense 
of his own worth, which is essential if he is to achieve. 
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As used in higher education, the term disadvantaged is vague 
and increasingly unacceptable to those deemed disadvantaged by others. 
It remains, however, the term generally used to designate groups of 
students from ethnic or socio-economic backgrounds that have in the 
past been underrepresented in American colleges and universities. Prac- 
tically, the term is now used most to describe students, regardless of 
financial or social circumstances, who are Afro-American, Mexican- 
American, Puerto Rican, or American Indian. As a matter of rhetoric, 
the term includes white students from families that are both poor and 
isolated from the middle class. Actually, the term is almost always used 
to refer to students who can be grouped in some simple way and, except 
for occasional references to Appalachia, white students are not prominent 
in programs avowedly for disadvantaged students, Ideological movements 
involving “third world” coalitions sometimes include Orientals, but the 
rate of college attendance of Orientals is apparently very high and their 
educational chievement is approximately the same as that of the general 


For the purpose of this chapter, we shall follow the customary usage 
and consider the disadvantaged to be members of groups that have his- 


are clearly below national averages on economic and educational indices. 
Much of the literature is concerned with black Americans. 

This is a particularly awkward time to review this research. Whether 
we consider that concern for expansion of educational opportunity can 
be traced to the academies of colonial times or that problems of exclusion 
and denial of access have just been discovered, it is clear that the ad- 
mission of large numbers of disadvantaged youths to colleges has been a 
matter of high priority for a very short time. 

— 


“Suppor for the preparation of this manuscript was supplied in part by a grant from 
e Ford Foundation, 


151 


REVIEW OF EDUCATIONAL RESEARCH Vol, 40, No. 1 


Five or six colleges have a long history of concern for black youths, 
but a substantial effort to increase enrollment in nominally unsegre- 
gated colleges probably did not begin earlier than the founding of the 
National Scholarship Service and Fund for Negro Students in 1949 (The 
Fund, 1956). During the 1950’s the United States was preoccupied with 
the desegregation of public schools, especially after the Supreme Court 
decision in Brown vs. Board of Education in May 1954. Gordon and 
Wilkerson (1966) considered the literature of higher education to be 
barren of attention to the problem before 1960. In 1964, they asked ' 
2,093 higher institutions to report any special programs and pravtices to 
help disadvantaged students. Only 610 institutions responded, and only 
224 of these reported any special program or practice. Considering the 
difficulties of answering and asking questions about the existence of de- 
sirable practices, perhaps the most that can be said is that by the early 
1960’s at least 10% of America’s higher education institutions were suf- 
ficiently aware of the disadvantaged to claim some special activity. 

It was not until the appearance of the Coleman Report, Equality of 
Educational Opportunity (1966), that there were substantial data con- 
cerning the extent of racial segregation in higher education. These data, 
based upon enrollments in 1965-66, indicated that America had one set of 
money that was about 98% black and another set that was about 98% 
white. 

It is impossible to say at what rate higher education might have 
developed a sense of urgency about the enrollment of minority youth in 
the normal course of events, for there was to be no normal course of 
events. The assassination of Martin Luther King, Jr. in April 1968 pre- 
cipitated a crisis of conscience and physical confrontation between college 
administrators and militant black youth. This crisis was in preparation 
for many years, but it occurred in the spring and summer of 1968 and 
established the academic year 1968-69 as the time when most institutions 
moved the problems of the disadvantaged near the top of their lists of. 
urgent problems. 

One consequence of this chronology is that in the summer of 1969 
a number of substantial studies were in progress or were completed but 
not reported. Thus, when this Review appears or within a few months 
thereafter, the research literature will contain a number of important ẹ 
items which can not be reported in this chapter. A second difficulty asso- 
ciated with the timing of this report involves how the basic problems are 
defined. In the past considerable attention was given to talent search 
projects, to studies of conventional tests, or to remedial courses; all were 
designed either to find or create conventional college students from dis- 
advantaged populations. It is now much more generally recognized that, 
as occasionally noted in earlier times (Eels, 1953; Gordon and Wilkerson, 
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1966) the more central problem is the reconstruction of the educational 
system to accommodate the population, rather than vice versa. American 
higher education, historically heterogeneous but usually designed for some 
selected population, is now asked to provide a useful experience for most 
young people, including those who can not afford to pay the bills, are 
not “prepared for college,” do not have “college ability,” and do not 
arise from the backgrounds that have provided even the self-made men 
of earlier times (McGrath, 1966). This does not make the research re- 
ported here irrelevant, but it pushes much of it to the side to make way 
for questions of purpose and organization that will generate important 
research in the future. 

Finally, no orderly account of research is now or will be possible as 
long as the crisis in values remains at its present pitch. The literature 
is immense—a recently issued bibliography on “school integration,” in- 
cluding much material on the transition from school to college, contains 
8,100 references (Integrated Education, 1969), most of recent date. Most 
of this literature is polemical, and almost all of it is based upon arguable 
and unsettled assumptions concerning such matters as the purpose of insti- 
tutions, the proper organization of society, and the best relation between 
study and action. 

The appropriateness of research as an approach to solving social prob- 
lems is under attack, partly because studies, demonstrations, projects, and 
reports have seldom been connected to dramatic institutional change. But 
confrontations have, in at least a few cases, been visibly connected to 
the appearance of change. It may be that research and confrontation tend 
to be their own rewards, but in each case only to those who do the 
research or make the confrontation. Those who have either faith or a 
stake in the proposition that the collection and analysis of data is a fruit- 
ful way to spend scarce resources in connection with disadvantaged youth 
must pursue with some energy the clarification of goals without which 


‘their ordinary work of data collection and analysis does not seem to have 


much point. 
Educational Attainment and College Attendance 


The best available data concerning educational attainment and col- 
lege attendance by disadvantaged students are for Afro-Americans, who 
are the largest and most frequently studied disadvantaged population, 
excluding always underclass whites who have not yet had much attention. 
For the United States as a whole, estimates in recent years have been 
that black students comprise between 5-71% of the total college enroll- 
ment (Coleman, 1966; Astin, Bayer and Baruch, 1968). By the mid-1960’s 
slightly less than half of all these students were in colleges identifiable 
as “predominantly Negro.” This was a substantial decrease since 1950 
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when about two-thirds of such students were in the predominantly Negro 
colleges (Jaffe, Adams, and Meyers, 1968). Even so, enrollments in the 
Negro institutions increased 21% in the two year period 1963-64 to 1965- 
66. This was very close to the national increase for all higher institutions 
(Commission on Higher Educational Opportunity in the South, 1967). 
It is difficult to be precise even about enrollment in the predominantly 
Negro colleges. Different studies use different lists of institutions and at 
least one new institution opened as a de facto Negro college without any 
clear intention of being such (Federal City College, 1968). 

Enrollment of Afro-American students in predominantly white colleges 
was impossible to estimate until recently and may well become impossible 
again. The first substantial data available were given by Coleman et al. 
(1966) and were based on estimates made by officials of institutions in 
connection with Opening Fall Enrollment Survey of 1965. Only 92% of 
institutions responded and the estimates were of unknown accuracy. Indi- 
vidual institutions were not identified. Even so, these figures were ev- 
tremely valuable since, even with generous allowances for error, they 
documented the extreme segregation existing in higher education. 

In 1967 and 1968 the Civil Rights Office of the Department of Health, 
Education and Welfare required colleges to file estimates of enrollments, 
classified by ethnic group, as evidence of compliance with the Civil Rights 
Act of 1964. These estimates were published for each reporting college 
in The Chronicle of Higher Education (1968, 1969). Considerable num- 
bers of institutions were not included in the published reports, although 
it is not clear that such institutions actually failed to certify their com- 
pliance with the Civil Rights Act. Users of the tables have found absurd 
entries, and the published report for 1968-69 is accompanied by an asser- 
tion that the data are unreliable. Even so, much of the data is apparently 
accurate and these reports make possible studies of the ethnic distribution 
of students in state systems or the higher institutions of particular metro- 
politan areas—matters of much importance. These data will no longer be 
collected by the Department. No doubt, there were many policy considera- 
tions involved in this decision, but certainly the research community would 
have been well served by a decision to improve enforcement and data 
collection rather than abandon the project. 

For the nation as a whole, some of the most fundamental statistics 
have to do with the rapid increase during the 1960’s of high school grad- 
uation for non-whites. During the period 1960-66, median years of school 
completed for non-white persons 25 to 29 years old increased from 10.5 
to 12.1 for males, and from 11.1 to 11.9 for females, During that same 
period, the per cent of non-white males completing four years of 
school increased from 36 to 53, while for females the increase was from 
41 to 49 (Bureau of the Census, 1968). These figures are of great im- 
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portance to higher education, for the high school graduate defines the 
population eligible to enter college. 


In the Negro population 25 to 29 years of age, the percentage of 
those completing four years of college or more increased from 4.3 to 6.8 
in the period 1960-65. During the same period white college graduates 
in the same group increased from 11.7 to 13.7% (Bureau of the Census, 
1968). Presumably, however, the rapid increase in non-white high school 
graduation during that period will be reflected in the Negro college grad- 
uation rates of the late 1960’s and early 1970’s. 


The figures above are notable in that non-white males had sub- 
stantially lower educational achievement than females in 1960 but had 
higher attainment by 1966. This is also reflected in the college graduation 
figures where females exceeded males 4.6% to 3.9% in 1960, but males 
led 7.4% to 6.4% by 1965 (Bureau of the Census, 1968). 


Although data on black students are inadequate, the situation for 
other ethnic groups is chaotic. Mexican-Americans constitute the second 
largest disadvantaged minority but they, with the Puerto Ricans, can not 
be enumerated except through the awkward device of the “Spanish sur- 
name.” Grebler (1965) gave an example of use of this device, working 
from census data. A minority study of this population was completed by 
a research group under the direction of Grebler at UCLA. The central 
report of this study is in press (Free Press, Glencoe, Ill.) as is a specific 
study of Mexican-American education by Thomas Carter of the University 
of Texas, El Paso (College Entrance Examination Board). These reports 
will be published in 1969-70. 


In general, data on the economic and educational status of Mexican- 
Americans, Puerto Ricans and Indian Americans suggest that these popula- 
tions are at least as disadvantaged as Afro-Americans (Coleman et al., 
1966), but there are enormous local variations. For example, the variation 
in educational attainment of Mexican-Americans in Texas alone varied 
in 1960 from 8.7 grades in Beaumont to 3.9 grades in Brownsville (Grebler, 
1967). This is like local fluctuations in educational attainments of Afro- 
Americans. In 1960, for example, Mississippi with more than 900,000 black 
citizens produced 15,000 black high school graduates, while Florida with 
880,000 black citizens had more than 40,000 high school graduates (Jaffe, 
Adams and Meyers, 1968). Again, it is important to emphasize that the 
educational opportunity available to disadvantaged students who lack the 
financial resources and sophistication to command the facilities of the 
nation as a whole is extremely dependent upon local circumstances. As 
several researchers have shown, the establishment of a junior college 
Where there has been none can affect poor students to a very great degree, 
as can local variations in financial aid policies or in the conduct of lower 
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schools. (Koos, 1944; Medsker and Trent, 1965; Bashaw, 1965; Willing- 
ham, 1969). 


Guidance and the Search for Talent 


During the 1950's a considerable amount of attention was given to 
the need for finding and developing America’s human resources. Much 
of this was from a “manpower” point of view; that is, research was con- 
ducted and reports issued demonstrating the loss to society resulting from 
an inefficient system of talent development. A national manpower council 
was established at Columbia University in 1951; it proceded with a series 
of conferences and reports on national manpower requirements and prob- 
lems (National Manpower Council, 1954). National studies of the loss 
of talent through inefficiencies in the social system, including particularly 
education, were made by Wolfe (1954), Berdie (1954), and Cole (1956). 
The National Merit Scholarship program was established to select talented 
youth for scholarships that would enable them to attend college (National 
Merit Scholarship Corporation, 1955). Colleges organized their own 
financial aid programs more efficiently than before and developed the 
principle that financial aid should be based upon need, as well as upon 
talent, to conserve financial resources for talent development (College 
Entrance Examination Board, 1956). Some attention in these reports was 
given to populations that are now called “disadvantaged.” 

One of the earliest major attempts to develop the talent of disad- 
vantaged junior high school and high school students was the Demon- 
stration Guidance Project carried on in New York City from 1956 to 1962. 
This project was organized as a demonstration rather than as closely con- 
trolled research. But a detailed assessment of the project made it appear 
that a determined effort, with strong financial support, to improve the 
instructional and guidance services available to a disadvantaged urban 
population resulted in a substantial increase in the number of such youth 
going to college (Wrightstone et al., 1963). This conclusion was con- 
sidered to be of major importance at the time, but financial support was 
not available for the continuation of the program. 

By the end of the 1950’s a number of scholars had begun to question 
the definition of “talent” as formal academic ability to the exclusion o 
social, entrepreneurial, and creative abilities not perfectly correlated with 
scholastic aptitude. A major statement of the position was made by 
McClellan, Baldwin, Bronfenbrener and Strodtbeck (1958). Nevertheless, 
such attention as was given to disadvantaged populations continued to 
emphasize: (1) the discovery of talent among youth who were being denied 
access to higher education by financial circumstances, racial discrimination, 
lack of motivation, or inadequate guidance; and (2) the development o 
talent through improved instruction that might create conventionally able 
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college students from populations that were educationally undernourished. 
It would be incorrect to say that these approaches to the disadvantaged 
are in total eclipse, but certainly they are not in the dominant position 
they held in the previous decade. 


As early as 1953, Eels, discussing cultural bias in intelligence tests, 
declared that such tests are adequate measures of “scholastic aptitude” 
as long as schools remained designed for the white middle class. He called 
for radical revision in educational programs rather than for attempts to 
develop conventional ability in these populations. This was later sub- 
stantially the position of Gordon and Wilkerson (1966). 


Coleman produced data showing astonishingly high apparent intention 
to attend college among black youth in the mid-1960’s. In metropolitan 
areas of the Western States (to take the extreme case), 85% of Negro 
youth in the study said they either definitely or probably would go on to 
college in the following year. 


Financial problems are still reported by students as major reasons 
for not continuing education beyond secondary school (Tillery, Donovan, 
and Sherman, 1969; Knoell, 1968)—an opinion which can scarcely be 
doubted by anyone familiar with the financial responsibilities and burdens 
of disadvantaged youth. Nevertheless, circumstances have changed since 
the 1950's. Jaffe and Adams (1969) reported that between 1959 and 1965 
the intention to go to college increased by 6% among students from 
affluent families, but the rise for poor students was 25%. Johnson and 
Reed (1969) reported that 35% of college families have incomes below 
the national median. Willingham (1968), reviewing this and other evi- 
dence, concluded that “expansion of educational opportunity necessarily 
involves expanded financial support” and emphasized the importance of 
the establishment of low-cost community institutions. The importance of 
the availability of appropriate local institutions is emphasized by the 
finding of Jaffe, Adams and Meyers (1968) in studying students of 
predominantly Negro colleges, that median family income was only $2,696 
for students in Southern two-year Negro institutions. 

The change of context in which financial problems of disadvantaged 
students are viewed is from the idea of financial aid for talented students 
to the idea of access to higher education for most, if not all, youth. But 
if compensatory education, motivation, and financial support are seen in 
a somewhat different way now, the popularity of talent searching is very 
much reduced, although perhaps it should not be. Earlier expectations 
that disadvantaged populations are rich in conventional scholastic ability 
have been disappointed. Coleman et al. (1966) showed low performance 
on educational tests for Afro-American, Mexican-American, Puerto Rican 
and American Indian students. Kendrick (1967) pointed out that colleges 
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will remain segregated racially if they confine their efforts to discovering 
talented black students resembling the white students already enrolled. 

The possibility of increasing basic scholastic ability through com- 
pensatory programs has also been questioned. Bloom (1964) emphasized 
the critical importance of the preschool years in the development of cog- 
nitive ability. Shaycoft (1967) reported no discernible effects on verbal 
ability from differences in high school programs over four years of in- 
struction. Jensen (1969) announced that compensatory education has 
failed; he argued for the importance of genetic rather than environmental 
factors in intelligence and scholastic achievement. However, he argued 
further that “one of the great and relatively untapped reservoirs of mental 
ability in the disadvantaged, it appears from our research, is the basic 
ability to learn.” He proposed that schools find new ways to utilize the 
strengths of children whose major strength is not conventional scholastic 
aptitude. At this writing, Jensen’s position has been criticized by, among 
others, Kagan, Hunt, Crow, Bereiter, Elkin, Cronback and Brazziel (Har- 
vard Educational Review, 1969). This controversy is likely to be one of 
the livelier ones in psychology during the next few years. However, the 
central theoretical issue is an ancient and classic one in psychology, and 
Jensen’s general position with respect to educational practice is much like 
that of Gordon and Wilkerson (1966) and Eels (1953) cited earlier, 
although probably much different in detail. The question not answered 
by anyone is precisely what talents require what program to what ends. 

The question remains whether there are now significant numbers of 
talented students in disadvantaged populations who do not have access 
to higher education. There is strangely little evidence on this point, 
although it would seem both important and comparatively simple, con- 
sidering the amount of testing that is done, for school systems to find out 
and report in detail what is happening to poor and minority youth who 
have done well in school or have good scores on tests. 

A study now being conducted at the College Entrance Examination 
Board and the Educational Testing Service, and a somewhat different 
study at the American Association of Junior Colleges may soon provide 
some evidence as to the existence and disposition of talented minority 
youth in nine or ten major cities. But we have not found reports sug- 
gesting that school systems have reliable information about the educational 
and occupational careers of their disadvantaged graduates and dropouts. 
Willingham (1969) cited early state studies by Berdie (1964) and Little 
(1959), but remarked that “few states have anything remotely resembling 
the sort of information on their high school seniors which would permit 
and encourage a thorough analysis of opportunity for post-secondary ed- 
ucation and its outcomes.” Continued development of Project TALENT — 
(Flanagan, et al., 1962), Project SCOPE (Tillery, Donovan, and Sherman, 
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1969), and studies by Astin and others at the American Council on 
Education (Astin, Panos, and Creagher, 1967) provide some data. 


It is true that the idea of the talent search as a method of singling 
out a few gifted minority students is no longer acceptable in the black 
community (Lane, 1969), or probably in other organized minority move- 
ments. The idea is certainly discredited that national educational strategy 
in a period of developing universality in higher education for whites can 
be based upon talent searching in minority communities only. But, in 
the absence of detailed information, there is the possibility that artificial 
impediments to opportunity for potentially talented youth will be ignored 
while public policy of the future, like that of the past, will be based upon 
the assumption that all black and brown students are exactly alike. If 
access to higher education for the disadvantaged turns out to mean routine 
assignment to the “general” curricula of comprehensive colleges, just as 
universal secondary education has often meant routine assignment to the 
“general” curriculum of comprehensive high schools, it is difficult to see 
what will have been gained. 


Manning (1968) reviewed the difficulties of using existing testing 
programs with disadvantaged students; he called for the redirection of 
testing at the point of transition from school to college to emphasize 
diagnosis and to improve the distributive and evaluative functions of 
educational systems. Both the American College Testing Program (1966) 
and the College Entrance Examination Board (1968) recently introduced 
new programs or redirected interpretive materials toward broader measure- 
ment for the purpose of junior colleges and other “open door” institutions. 
These are very faint and even primitive beginnings, but they may revive 
the idea of the talent search in a totally new context. 


The great mystery of the literature pertaining to disadvantaged popu- 
lations is in the role of school guidance programs. The encouragement 
and assistance of disadvantaged students in aspiring to college and suc- 
cessfully overcoming the many difficulties in transition is widely believed 
to be one of the functions of guidance programs. However, in the most 
recent Review of Educational Research on the subject of guidance and 
counseling (April 1969), there was no mention of disadvantaged students 
except in a brief discussion of programs in higher education, and three 
citations under “counseling students with special problems.” Island 
author of the latter chapter, noted that “research on counseling black 
students in public schools has begun to appear in the literature.” Certainly, 
the remarkable number and variety of college placement programs for 
disadvantaged youth that have developed outside the schools, with or 
without federal support (Educational Talent Search Program, 1968) sug- 
gests that there is a crisis in guidance services in the schools. If so, this 
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crisis is not reflected in the research literature or in articles appearing, for 
example, in the Personnel and Guidance Journal. 

Coleman et al. (1966) studied certain aspects of guidance services by 
questionnaires directed to the schools in connection with the massive study, 
Equality of Educational Opportunity. It was found that black pupils have 
slightly more access to counselors, as judged by counselor-pupil ratios 
(but counselor load is still high). Outside the South, counselors of black 
youth were usually white (80%). For black students, especially girls, the 
fit between aspiration and ability was much better in schools where there 
were black counselors. These data are far from conclusive—this is a very 
small part of the report—but they do represent an approach to providing 
data about matters of importance that are much discussed but little studied. 

Although we found little appropriate literature, we think it is im- 
portant to say here that in our experience, members of disadvantaged 
minorities are extraordinarily bitter about failures both of omission and 
commission by guidance personnel. The usual complaint is that guidance 
officers in the schools do not “relate” to minority youth, that these 
youth are assigned to vocational curricula en masse on the basis of ethnic 
membership, and that they are not properly encouraged and assisted in 
seeking higher education. Regardless of the merit of these criticisms— 
and certainly no simple set of charges can properly be applied to the 
large and diverse guidance profession in the United States—it is extremely 
serious that there is so little information about the nature and effective- 
ness of actual guidance programs in schools that enroll large numbers of 
disadvantaged youth. Enormous sums of money are expended in support 
of the schools and no external agency can hope to achieve the efficiency 
in assisting the student that is possible in the school where he spends 
most of his time. It is a major failure of educational research that this 
central process in the transition from school to college should be so little 
studied and reported. 


Admission to College 


Although the great majority of disadvantaged students now in college 
attend institutions of little or no selectivity (Coleman et al., 1966; Kend- 
rick, 1967), it is perhaps not surprising that a disproportionate amount 
of research related to transition can be found in the literature on selection 
of students by colleges. If students are not in college, someone must be 
keeping them out. Even the most routine operation of admissions requires 
or strongly encourages the local production of correlational analyses and 
other studies using quantitative methods. These can be published and 
are, and they inflate the literature. It is usually assumed that college 
admissions is a special case of the simpler kinds of industrial selection— 
that is, the problem is defined by a job definition and the available labor 
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pool. This makes for apparent ease in the conduct of research. Thresher 
(1966) provided a thorough critique of existing admissions systems and 
forecast radical changes in admissions as part of radical changes in the 
role, function, and nature of higher education. But most of the research 
that can now be reported considers the college as fixed, and the student 
as variable. 

The trend in the selection of researchable problems has followed the 
general pattern found in the broader area of transition from school to 
college for the student population at large: an extensive emphasis on the 
identification and selection of students for admissions to college and less 
on institutional characteristics. Moreover, compensatory program elements 
and their complex interactions with institutional factors in relation to 
student variables have not been adequately considered (Kurland, 1967). 


Admissions Credentials 


“Admissions credentials” refer to all those factors considered as im- 
portant indices of probable success in college. Such factors as previous 
academic scholarship (usually defined in terms of high school grades and 
rank), aptitude, recommendations from relevant authority figures (parents, 
teachers, community leaders, etc.), personal interviews, biographical pro- 
files, and non-intellective personality factors are examples of “credentials” 
which colleges may plausibly use in making admissions decisions. 


Identification and Selection 


Research on the identification and selection of disadvantaged students 
has centered primarily on the examination of the validity of two traditional 
predictors of college success: high school scholarship and preadmissions 
test scores. Numerous studies and reviews of the literature have produced 
a number of generalizations that have been challenged recently in regard 
to this sample of the general student population. Generally high-school 
scholarship (i.e., over-all average or rank in the graduating class) has 
been found to be the best single predictor of college success (e-g., ACT 
Program, 1965; Beatley, 1922; Garrett, 1949; Richards and Lutz, 1968; 
Tribilcock, 1938). High-school scholarship correlates higher with first- 
year freshman average than with any lesser or greater amount of the 
college record (Garrett, 1949). Women have been found to be more pre- 
dictable academically than men (Munday, 1967; Seashore, 1962; Stanley, 
1967). The efficiency of prediction has not been substantially increased 
beyond that obtained through the optimal weighting of a single aptitude 
or scholastic ability test consisting of one or two scores and high school 
grade point average (Carlson and Milstein, 1958; Cochran and Davis, 
1950; Danskin and Hoyt, 1960; Davis, 1965; Dwyer et al., 1940; Glad- 
felter, 1936; Henderson and Masten, 1959; Hills, 1965; Schmitz, 1937; 
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Webb and McCall, 1953). Surveys of the literature by Cosland (1953), 
Durflinger (1943), Garrett (1949), Giusti (1964), Harris (1940), The 
Joint Committee on School College Relations of the American Association 
of Collegiate Registrars and Admissions Officers and the National Asso- 
ciation of Secondary School Principals (1962), Segel (1934), Travers 
(1949), and Wagner (1934) give ample evidence of research findings 


obtained during the past thirty years which have contributed to the above — 


generalizations. 


As the press for post-secondary education became more intense among 
black and other minority groups, and as it became increasingly apparent 
that predominantly white institutions were systematically excluding such 
groups from their campuses—for instance, less than 2% of the general 
student population at primarily white colleges are black (Kurland, 1967) 
—as a consequence of their admissions policies, a question about the 
adequacy of traditional measures of probable collegiate success has come 
to the forefront (Gordon, 1965; Kendrick, 1965; Society for the Psycholog- 
ical Study of Social Issues. 1964). More specificially, serious questions 
about the predictive validity of such indices as high school scholarship 
and test scores for students whose latent talent has not been previously 
realized has been posed by many researchers (Brown and Russell, 1964; 
Cameron, 1968; Clark and Plotkin, 1963; Fishman and Pasanella, 1960; 
Fishman et al., 1964). 


Such probing inquiries have motivated other researchers to investi- 
gate the differential predictability of various instrumental assessments and 
high school scholarship on college success for non-white (primarily black) 
students. Studies conducted by Boney (1966); Hills, Klock and Lewis 
(1963); Roberts (1962); and Stanley and Porter (1967) give evidence that 
the Scholastic Aptitude Test (SAT) of the College Entrance Examination 
Board is as valid for predicting grades of students in predominantly black 
colleges as for predicting the college grades of white students. Further, 
when SAT scores were used in combination with school rank, similar 
predictive validities have been found between black and white students 
(Olsen, 1957; Roberts, 1964). The possible bias of the SAT in predicting 
college grades of black students at integrated colleges was investigat 


by Cleary (1968). She concluded that there were no significant differences . 


in prediction for black and white from the two Eastern colleges selected 
for the study. Although there was a difference in the regression lines for 
black and white students at a third college (located in the South-west), 
it was a matter of predicting black students’ college grades too high by 
the use of the white or common regression lines. Morgan (1968) discus 

the utility of the SAT-Mathematics score for identifying “calculated risk” 


students. Munday (1965) found thaat the American College Testing Pro- 
gram (ACTP) battery was as useful for predicting the grades of socially 
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disadvantaged students as it was for predicting the grades for other 
students. 

A few studies produced some evidence that perhaps the relative 
utility of high school grades as predictors of college success for students 
from socially and economically excluded ethnic groups should be reap- 
praised (e.g., Thomas and Stanley, 1969). Munday (1965) employed five 
separate criteria (college English average, college social studies average, 
college mathematics average, college science average, and over-all college 
average) and found the multiple R derived from optimally weighting four 
high school grades in each category was lower than the multiple R de- 
rived from the optimal weighting of the four ACT tests. McKelpin (1965) 


average grades for entering freshmen than high school grades did with 
the same criterion at a predominantly black college in Durham, North 
Carolina. No substantial differences in the predictive validities of the 
two preadmissions indices were noted in the case of black female students. 
Re-examination of Cleary’s data (1968), mentioned earlier, revealed that 
for blacks in one of the integrated colleges SAT-V and SAT-M correlated 
higher with college grade point average than did high school rank. Such 
relative superiority of test scores over high school grades was noted in the 
data provided in studies by Funches (1967), Perlberg (1967), and Peter- 
son (1968). 

Investigators in two studies have concerned themselves with the 
manner in which the measure of high school scholarship is derived for 
purposes of making decisions regarding admissions—a matter of real im- 
portance in the recruitment of disadvantaged students. Cramer and Herr 
(1967) concluded that ranking the entire high school class as opposed 
to ranking in a college bound-non-college bound dichotomy is as efficient 
for predictive purposes as any other ranking procedure. Gelso and Klock 
(1967) found that the total high school average in contrast to the high 
school academic average is important—especially for those colleges using 
a high school average cutoff point. They concluded that the total average 
will more often than not exceed the cutoff point and thus allow a greater 
percentage of student admissions. 

Substantial increases in the predictive validity of high school grades 
that have been adjusted for inter-secondary school variability of grading 
practices have been reported by Bloom and Peters (1961). However, the 
value of such efforts has been questioned (Hills, 1965). Lindquist (1963) 
observed that scaled grades rather than unsealed high school grades could 
only reduce the variability in predicting college grades by less than 2%. 


Biographical Data as Predictors 
Although not usually focused on any specific sample of the general 
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student population, results of studies which investigated the value of 
biographical data as predictors of college success may suggest their use 
with disadvantaged students. Although the value of such data as predictors 
varies according to type of information gathered and the nature of the in- 
stitution from which criterion college averages are obtained, biographical 
data were found to make some contribution to the accuracy of prediction 
(Hills, 1965). Willingham (1964) found encouraging results when he 
analyzed the application-blank data of his student population rather than 
relying on separate questionnaires. Brown and Dubois (1964) found cer- 
tain biographical data of value in predicting college grades for high ability 
engineering students at Iowa State University. When non-intellectual 
factors were emphasized in the criterion of college success, a biographical 
inventory was found to provide consistently higher predictive validity 
coefficients than those obtained with SAT-Verbal and SAT-Mathematical 
scores (Anastasi, Meade and Schneiders, 1960). However, in other studies 
the investigators were unable to obtain any measurable predictive validity 
from biographical data (Garcia and Whigham, 1958; Webb, 1960; Hilton 
and Myers, 1967). 


Non- intellective Correlates of Academic Achievement 


The quest for non-intellective correlates of college success for college 
aspirants in general (Cramer and Stevic, 1968) and the disadvantaged 
student in particular has been discouraging. Studies by Donnan (1968), 
Frederiksen and Gilbert (1960), Frederiksen and Melville (1954), Holland 
(1960), Hoyt and Norman (1954), Richards, Holland and Lutz (1967), 
Spencer and Stallings (1968), Stecker and Voigt (1968), Watson (1967), 
and others which seemed to imply that non-intellective factors may be 
useful and that predictability may vary systematically with the nature of 
the student groups for which the ?’s are computed (Ghiselli, 1960a, 1960b; 
Munday, 1968) have motivated a great deal of concern about their use- 
fulness for predicting the college success of students whose academic creden- 
tials are questionable. An extensive survey of the literature and valuable 
insight into this area, as well as the entire problem of predicting collegiate 
success, were provided by Lavin (1965) and Stein (1963). 


Identification of Non-Achievers 


One of the implications of many studies in which non-intellective 
factors of college students were investigated is: if enough valid traits can 
be identified which differentiate students with academic problems from 
those who are successful, then the identification on successful versus un- 
successful students along with the accompanying prescription of appropriate 
collegiate experiences is possible. Such studies generally adhered to & 
cross sectional approach whereby achievers and non-achievers in college 
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were psychometrically compared during the freshman year. Differences 
on personality variables between those who persist and those who with- 
draw were obtained, for example, by use of the Minnesota Multiphasic 
Personality Inventory (Drasgow and McKenzie, 1958; Vaughan, 1967) 
and the Gough Adjective Check List (Heilbrun, 1962). Although there is 
a general consensus about the existence of such differences, there is no 
agreement regarding their specific nature (Hill, 1966). Students who en- 
rolled in the work-study programs from low-income families were found 
to perform as well or slightly better academically than a matched group 
of more affluent students not in a work-study program (Bradfield, 1967). 
Katz (1968), reporting on his study of academic motivation, suggested 
that among low achievers, greater self-criticism and less favorable-self- 
evaluation existed and that these factors tended to be generalized as ac- 
quired or secondary reinforcement for the reduction of anxiety levels. 
Engle, Davis and Muzer (1968), investigating the influence of peer group 
acceptance on student behavior, supported the idea that acceptance by a 
peer model can have a positive effect on the scholastic performance of 
underachievers. 

Vaughan (1968a) suggested that there are cognitive and personality 
factors that differentiate students who are dismissed for academic reasons 
from those who withdraw voluntarily. Other studies which sought to 
identify students who are likely to be non-persisters in either a given 
curriculum or a particular college were conducted by Marks (1967), 
Vaughan (1968b), Chase (1968), Faunce (1967; 1968), and Demos 
(1968). 


Nominations vs. Test Performance 


The use of community or school nominations as opposed to test 
scores as credentials for admissions to college was suggested by many 
but employed by few educators and other college personnel who are 
working in the area of education for the disadvantaged. Although it 
involves the “cream of the crop” of black students, and hence is not ad- 
dressed to the urgent problem of the higher education of the academically 
disadvantaged youth, a recent study gives some insight into the compara- 
tive value of these two methods. Blumenfeld (1969) conducted a study 
to determine what differences may exist in the kinds of students identified 
by two different methods used in the National Achievement Scholarship 
Program (NASP) of the National Merit Scholarship Corporation, a pro- 
gram designed to recruit highly talented black youths into higher educa- 
tion. On the basis of their avenue of entrance and/or their test performance 
on the National Merit Scholarship Qualifying Test (NMSQT) (Science 
Research Associates, 1966), subjects were classified into four groups: 
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Group 1: Students selected through high school nominations only; 

Group 2: Students selected through high school nominations and who 
also took the NMSQT but scored below the arbitrary cut- 
off point; 

Group 3: Students selected through high school nominations and who 
also took the NMSQT and scored above the cutoff point; 
and 

Group 4: Students who took the NMSQT only and who scored above 
the cutoff point. 


Using indices of breadth of coverage and rate of success in the com- 
petition as the criteria for judging the relative efficiency of each method, 
Blumenfeld (1969) examined several student and school characteristics 
of the groups of students, particularly the nomination-only vs test scores- 
only groups. He concluded that the typical student who entered com- 
petition solely through test scores (Group 4): (a) made a higher score 
on a test given during the second stage of the screening process; (b) 
ranked lower in class; (c) had a lower grade-point average; (d) aspired 
to a higher academic level; (e) came from a larger high school class; 
(E) had parents with higher educational attainment; (g) had more friends 
planning to attend college; (h) had resided in his community for a shorter 
period; and (i) was younger than the typical student who entered solely 
through high school nominations (Group 1). Furthermore, the members 
of Group 4 were more likely to come from a high school which: (a) had 
a lower proportion of blacks; (b) had more books in the library; (c) was 
less apt to be a public schools; (d) more often had a chapter of National — 
Honor Society; and (e) more often had a Dean’s List. Blumenfeld sug- 
gested that the reason Group 4 students (high test scores only) made 
lower high school grades and rank than did Group 1 students (nomina- — 
tions only) was because they were possibly in superior schools where they 
were not as visible to the nominators as were their counterparts in rela- 
tively less competitive schools where it takes less ability to stand out 
academically. Moreover, he observed that had the nomination by 
method been employed exclusively, Group 4 students would have been 
overlooked. 

In conclusion, Blumenfeld noted that the two procedures yielded 
roughly the same proportion of successful participants, although the 
test score-only method resulted in slightly more. The nominated group 
tended to have a broader range of personal characteristics and to come — 
from a greater variety of schools. The nominated group exhibited a less 
favorable socio-economical and educational background. In terms of 
criteria of success, Blumenfeld noted that a more accurate evaluation of 
the two methods will be available once these students’ performance in — 
college is known. ‘ 
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Summary 


In general, studies of the prediction of college success, identification 
and selection of college entrants have only dealt peripherally with the 
problems of ethnic minority group students who, with considerable prob- 
ability, would be unable to compete successfully in the regular college 
program without additional assistance. Investigators of most studies have 
concerned themselves with the problem in the broader sense. Moreover, 


other area of educational research on the transition from school to college. 
However, the implications of such studies gives some direction as to what 
approaches are feasible for disadvantaged students. 


Collegiate Programs for Disadvantaged Students 


Gordon and Wilkerson (1966, p. 134) defined compensatory programs 
and practices. “A continuing activity by an institution that helps disad- 
vantaged students who could not otherwise do so to enroll and progress 
in college is . . . termed a compensatory practice... An organized group 
of related activities to the same end is . . - termed a compensatory pro- 
gram... .” Modified admissions policies, financial assistance, and tutorial 
service are examples of the former and special instructional programs 
would be an example of the latter. “Remediation” implies that an institu- 
tion is attempting to get a student from where he is to where he wants 
to be; it conveys the image of providing students with a second chance 
(Roueche, 1968). k 

Various types of programs and practices have been used to assist 
students in their efforts to successfully complete a college education. 
Summer-preadmissions programs, reduced course load, remedial courses, 
tutorial assistance, guidance and counseling, extended length of time to 
meet graduation requirements, and financial assistance are but a few of 
the elements that have been employed either singly or in combination to 
meet the needs of disadvantaged students. Gordon (1967) noted that 
although the practice of offering non-credit remedial courses—mainly in 
English but also in mathematics—is still widespread, it appears to be 
losing ground. A substantial number of institutions were found to have 
ceased this practice for a variety of reasons. A major cause of discon- 
tinuation, it was observed, is the paucity of evidence that these courses 
improve academic performance. In examining the extensive use of various 
compensatory practices, Gordon and Wilkerson (1966) noted that practices 
designed to assist disadvantaged students after they have entered college 
predominated among the institutions reporting in the spring of 1964. 

Almost two thirds (62%) of the frequencies were accounted for by 
counseling, credit and non-credit courses, instructions in study skills, 
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tutoring, special curricula, and extended time for completing degree re- 
quirements. A recent survey by Nash (1969) found extra counseling, tutorial 
assistance, modified admissions standards, and financial aid as the most 
frequently mentioned practices for 211 senior colleges offering assistance 
to disadvantaged students. 

Remedial or developmental programs (particularly remedial English) 
were found to be prevalent at the community college level (Roueche, 1968). 
Berg and Axtell (1968) reported that a series or block of remedial courses 
appeared to be the most numerous type in their study of California com- 
munity junior colleges. In general, the major objective of these programs 
is to increase the competence of disadvantaged students in communications 
and numerical skills. Other provisions noted by Berg and Axtell were 
part-time employment, free transportation, free lunches, legal assistance, 
central student meeting facilities, special recruiting, and special summer 
orientation. However, the national survey of junior colleges by Schenz 
(1963) revealed that only 20% of these institutions designed special cur- 
ricula for disadvantaged students. Community junior colleges apparently 
are attempting to meet the needs of low-achieving students by providing 
remedial courses which are typically available to all students (Roueche, 
1968). 

At the community college level, several surveys were conducted to 
determine how widespread were the various supportive programs and prac- 
tices to assist disadvantaged students. The Curriculum Commission of the 
American Association of Junior Colleges authorized a national investiga- 
tion of the practices which junior colleges followed regarding the cur- 
riculum offerings for students with low ability. In a subsequent study, 
Schenz (1964) surveyed the views of junior college administrators regarding 
possible solutions. The conclusions of the study revealed that 91% of 
the junior colleges followed an “open door” policy for all high school 
graduates and for all persons eighteen and over who could profit from 
the instruction. Martyn (1966) confirmed this finding in his investigation 
of California junior colleges. Examining enrollment data and patterns 
of student persistence in community colleges, Jaffe and Adams (1969) 
noted that some serious questions could be raised with regard to the 
impact of the “junior college movement” on the later vocational oppor- 


tunities of racial minority youth. For example, they observed that roughly - 


one in three of two-year college entrants transfer to senior college, another 
third complete vocational programs, and the remainder simply drop out 
of school. Roueche (1968) cited an unpublished study by Paul Lawrence 
(1966) which indicated that only 13 or 14 junior college districts were 
making any substantial effort to attend to the educational needs of dis- 
advantaged college students in the state of California. Berg and Axtell 
(1968) found that the majority of the institutions in their study (53.4%) 
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attempted to meet the educational needs of the disadvantaged through the 
regular instructional program while slightly more than a fourth (27.4%) 
had implemented special instructional programs. More institutions (39.2%) 
reported they were attempting to meet the needs of disadvantaged stu- 
dents through the regular counseling service than through additional 

individual counseling (33.8%) or additional group counseling (0.0%). 
The pattern of compensatory education in the senior colleges is 
similar to that found in community junior colleges. Gordon and Wilkerson 
(1966, p. 155) concluded that most college level compensatory programs 
and practices “. . . seem to fit the somewhat dreary pattern of remedial 
courses which have plagued many generations of low-achieving students 
with but little benefit to most of them.” Moreover, it was noted that 
although an increasing number of institutions of higher education are 
attempting through a variety of approaches to help socially disadvantaged 
youths, proportionately very few of the nation’s colleges and universities 

have thus far begun to develop compensatory programs and practi 
most of those that have are serving very small numbers of disadvantaged 
students, It appears that few institutions are recruiting disadvantaged 
students with serious academic deficiencies. Most colleges and universities 
are addressing themselves to youths that have limited financial resources 
(Williams, 1969). In their survey of two- and four-year institutions, 
Gordon and Wilkerson (1966) found that 36.5% of the responding insti- 
tutions indicated they were conducting a variety of compensatory programs 
and practices. Three hundred eighty-six institutions (63.3%) reported 
they were not conducting any compensatory practices. The pattern of re- 
ports of “some” involvement was fairly consistent for the nine geographical 
regions surveyed. When the distribution of institutions was examined by 
control, it was revealed that only those institutions under city or district 
control reported a greater proportion of “some” compensatory 
(57%) than those reporting “no” involvement (43%). State, private, and 
religious institutions were fairly consistent in reporting “some” or “no 
involvement in collegiate compensatory practices: each at j 
proportion reporting “no” practices (range = 63-69%) than “some” com- 
pensatory practices (range = 31-37%). A recent national survey tended 
to confirm the findings in the Gordon and Wilkerson 

` Thomas, 1969). 

A definite relationship appears to exist between the types of dominant 
curricula offered at the various institutions and the presence of com- 


pensatory practices. It was found that roughly 90% of the colleges re- 


porting compensatory practices were liberal arts institutions (Gordon and 
Wilkerson, 1966). 

Of the 156 colleges and universities providing descriptions of their 
efforts to assist disadvantaged students to the Middle States Association 
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of Colleges and Secondary Schools (1968), Thomas (1969) estimated 
that 42.5% were “uncommitted” in terms of the extensiveness of institu- 
tional provisions available to disadvantaged students. Institutions reporting 
“some overtures” at addressing themselves to the problems of disadvantaged 
students were estimated to be only 18.9% of the total number of institu- 
tions reporting. This would appear to parallel the conclusion drawn by 
Egerton (1968) that less than 11.5% of the 162 institutions responding to 
his survey were initiating programs of substantial impact upon their re- 
spective resources and skills. Moreover, he observed that the major debate 
often centered on whether institutions of higher education should become 
engaged in activities for the disadvantaged rather than on how to proceed 
with this challenge. 


Effectiveness of Programatic Efforts 


Roueche (1967) noted the ineffectiveness of remedial programs at 
the community junior college level. He advanced the premise that the 
ineffectiveness of such programs is due in large measure to uncertainty 
about what the programs’ basic goals should be. In regard to community 
junior college admissions procedures Roueche (1968, p. 3) noted that: “The 
open-door concept is valid only if students are able to succeed in their 
educational endeavors. Currently, the only tenable value seems to be that 
enrollment allows a student to say, years after his short tenure, ‘I went 
to college.’ But except for this inestimable benefit, little else is apparent.” 
For example, Bossone (1966) found that from 40-60% of the students 
enrolled in remedial English classes in California public community junior 
colleges earned a grade of D or F. Only 20% of the students enrolled in 
these remedial courses later enrolled in college credit courses. Other studies 
have also found similar conclusions at the community junior college level 
(Gold, 1965; Schenz, 1963; Thelen, 1966). Berg and Axtell (1968) noted 
that most of the programs for disadvantaged students in California com- 
munity junior colleges are based upon or include various characteristics 
identified by Coleman et al. (1966) as being ineffective with respect to 
student achievement. Goldberg, Passow and Justman (1966) concluded 
that ability grouping was neutral and its value was contingent upon the 
extent to which specific instructional techniques based upon the needs of 
the students were used. Meister and Tauber (1965) reported lower than 
usual attrition rates among those students who were provided special 
services. 

Studies at the senior college and university level have been particularly 
limited in regard to effectiveness and impact of compensatory programs 
and practices in ameliorating the academic deficiencies of disadvantag' 
students. Alexakos and Rothney (1967) observed that students under- 
going a high school guidance laboratory experience tended to perform 
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better in college than a matched group who did not receive such an ex- 
perience. Meister et al. (1962) reported that the program Operation 
Second Chance has produced a reversal of the trend toward academic 
failures. Froe (1966) reported on the innovative and flexible three-track 
program developed at Morgan State College. For upgrading the quality 
of remedial programs, Theresa Love (1966) described four approaches 
that were developed to stimulate language development in youths with 
linguistic handicaps. When attrition rate has been considered as a cri- 
terion of success, it has been noted that in most instances, disadvantaged 
students’ holding rate was not measurably different from that of regular 
students (Williams, 1969). However, very little systematic research has 
been conducted to determine whether the comparable retention rate for 
disadvantaged students is @ function of innovations in compensatory pro- 
grams or other factors, i.e., selection of less competitive courses, lighter 
course loads, atypical persistence patterns. 


Summary 


Research on the extensiveness and effectiveness of compensatory pro- 
grams and practices has been limited in quantity and scope. Yet, even 
with the paucity of evaluative studies, it is safe to note that evidence 
points to the conclusion that existing compensatory programs and prac- 
tices have made little impact in eradicating the problems of disadvantaged 
college students, nor have the majority of colleges accepted this area as 
their role. 


Conclusion 


Selection of research for this chapter was, of course, arbitrary though 
not capricious. In present usage, the “disadvantaged” are arbitrarily de- 
fined. They do not, in fact, have many educational problems which are 
basically different from those which have always been present in the 
schools, We could simply define the disadvantaged as all those having 
difficulty in school, but we are mainly concerned here with the pressures 
exerted on the entire social system by the rising expectations of formerly 
excluded groups. 

We have excluded or skimped some areas which might reasonably 
eed to be mentioned. In the past decade, 


have been included, and these n ; 
financial aid to college students has become @ complex technical area with 
The predominantly 


a body of specialists and a literature of its own. 
Negro colleges have received more scholar! 


gested or than some 0! 
public higher education and the appearance of state master plans a 


have more effect on disadvantaged students than any present or likely 
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future idea for compensatory programs. The development of computer- 
assisted guidance and instructional systems may (or may not) be a crucial 
development of the 1970’s. We have said little or nothing of ethnic 
studies, curricula, community involvement, the movements for black (or 
brown) self-determination and the other new developments which seem 
likely to make current discussion of compensatory programs, guidance, and 
selection seem anachronistic before much time has passed. 


Some of these matters are not currently yielding research to be re- 
viewed. At the beginning of this chapter we noted that the current crisis 
in values and fundamental social arrangements is raising questions not 
merely about the adequacy of current research but about the possibility 
that research itself does not enjoy the automatic esteem that was once 
accorded it. In part this is a matter of distrust by minority groups who 
have in the past been studied to the advantage of every one but them- 
selves. In part it is a matter of organization. Reports now are often 
addressed to questions that are no longer interesting. Basic descriptive 
data are needed. Researchers do not know what is happening to young 
people in school or after school. In most states and cities it is unknown 
how many black students, or Mexican-Americans, or Puerto Ricans are 
graduating from high school and what advice they are getting, from 
whom, or to what effect. 

Much formal research—the kind that can be reviewed—is, too, ato- 
mistic. It is doubtful that any isolated educational “practice” has ever 
been validated or invalidated by formal research. Students, especially 
poor students, have or lack access to higher education within a metro- 
politan area or state system of high schools and colleges. These systems, 
which include private institutions and voluntary agencies, have organic 
relationships which need to be studied. 


Much research is too slow, as are the methods of dissemination of 
research results. The disadvantaged are extraordinarily vulnerable to 
legislative or administrative decisions made in any one of numerous places. 
Since educational research is overwhelmingly applied research, it too is 
affected by such decisions, Traditional methods of research dissemination, 
including this Review, increasingly are primarily historical records. 

At the same time, the gap between practice and basic research is, 
as always, painfully wide. We have reported here the growing conviction 
among serious investigators that the improvement of higher and lower 
education for the disadvantaged depends upon better definition and mea- 
surement of “talents” and better relations between these measurements 
and instructional programs. But it is difficult to believe that what is 

known” now about talent and instruction is known to many of those 
who directly effect disadvantaged youth, or that our institutions are or- 
ganized to take advantage of what has been known for a long time. 
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Finally, action in the educational system is being taken, as it must 
be, without any regard for the research schedules of individual research 
scholars or agencies. Some policy centers are informed by whatever re- 
search is within each; some are not. But neither the disadvantaged young 
people who intend to go from school to college nor the school and college 
officers involved in the process are prepared to wait upon conventional 
research procedures. It will increasingly be the case that much of the 
research that really affects the transition from school to college will appear 
in the literature long after its principal applications have been made. 

None of this argues that research is useless. To the contrary the 
development of effective research facilities for education is just begin- 
ning. But it does argue that better methods than researchers now have 
for keeping one another informed and for reporting what has been done 
are needed. 
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FOREWORD 


Growth over the 
past decade in educational evaluation theory, research and practice is appar- 
ent. I regard much of the current growth as proliferative with some long 
term promise for structural and functional differentiation. This review does 
encourage evaluators and their clients that the field may be progressing 
beyond the stage of creating grand plans. 


This issue of the Review was intended to be a comprehensive critique 
of the growth of educational evaluation theory, research and practice. It 
fails to deliver what the Issue Editor promised the Review Editor more than 
a year ago. We report little about the historical growth of evaluation 
methodologies, less about the models currently competing for selection and 
use, and nothing about educational product evaluation. And what of the 
expertise and planning needed, the procedures required, the probable 
outcomes and costs of conducting evaluation studies? There is much work 
to be done. 


Unfortunately, other educational evaluators also promise more than 
they can deliver (accountability, wise decisions, summary indices of worth) 
and ladle spoons full (or shovels full) of rhetoric to hungry clients. Such 
behavior cannot be tolerated if the field is to survive—much less develop. 
Fortunately, this issue does offer critical reviews of some of the important 
work we have at hand. Several chapter authors go beyond their charge to 
describe and assess the pertinent research literature and present their 
formulation of the work needed to assure real development in educational 
evaluation. Their contributions may be the growing edge of the field. 


Terry Denny 
Issue Editor al 
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1: OBJECTIVES, PRIORITIES, 
AND OTHER JUDGMENT DATA 


ROBERT E. STAKE 


University of Illinois 


This review is embedded in a plea. It is a 
plea to treat educational objectives as data. Fallible data. The review 
itself surveys methods for those particular data that reflect judgment of 
what education should accomplish. 

Evaluation requires judgment. Decision-making requires judgment. 
Both are judgmental in themselves but also depend on judgments previously 
made. A school and a curriculum are where they are because of judgments 
from within and from without. Judgments are made early, and late, and 
in between times. To understand what a school is doing requires an 
understanding of what a school is expected to do. 

In education, as elsewhere, judgments will continue to rest on incom- 
plete knowledge, imprecise measurements, and inadequate experience. No 
error-free system is possible, but improvements are within easy reach, The 
evaluator may lessen the arbitrariness of judging and decision-making by 
introducing data-gathering methods already developed by other social 
scientists. Social psychologists, behavioral scientists, economists, political 
scientists, and historians routinely study opinions, preferences, and values. 
Many of their methods can be used to measure the judgments that shape 
an educational program. 


Judgment Data 


In this section several different kinds of judgment data are identified. 
Emphasized here are judgments of what educators should do rather than 
judgments of what educators have done. Some contemporary designers 
of evaluation studies call for the collection of judgment data, others do 


not. Their works are cited. 


What Educators Should Do 
Personal value-commitments, educational aims, goals, objectives, 


*Leonard Cahen, Educational Testing Service, and Dennis Gooler, University of Illinois, 
served as consultants to Dr. Stake on the preparation of this chapter. 
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priorities, perceived norms, and standards—in one form of expression or 
another—are judgment data. An analysis of these data should reveal 
what someone wants school people to do. Many data-messages are only 
indirectly informative but still valuable. When a teacher says, “I believe — 
a child should learn to be responsible to himself for what he does during 
study period”; when a parent says, “I want my son to go to college”; when © 
a letter-to-the-editor writer says, “Kids can’t be trusted”; or when a high 
official says, “Reading deficiencies are a national disgrace,” judgment 
has been passed. Sometimes the subjective character of the utterance is 
obvious; sometimes the speaker is unaware that he has expressed a value. 
But in either case he is making a contribution to a value background that 
is, in fact, a definition of success and failure. 


Success does not mean hitting a bulls-eye; success means coming — 
acceptably close to a valued target. The responsibility of the evaluator is 
not only to find a good target-test and to tag the discrepant shots; he 
should also learn what accuracy is appropriate. He should learn which 
people hold the goal in high regard and which do not. But often an 
evaluator reports gain-score data with decimal precision and no data at all 
on the suitability of the instructional goals. 


Compared to many educational-research data, judgment data are — 
messy. They seem particularly susceptible to the obtrusiveness of formal 
evaluation, But even in natural conversation they are half shrouded, 
ambiguous, and imbued with emotion. Preferences are often inconsistent 
and arguments are circular, Whatever the clarity or confusion, that clarity 
or confusion should be acknowledged. Whatever the conflict and diversity, — 
that too should be acknowledged. Whether people should think more 
rationally is not the issue since it is not the evaluator’s task to reform 
human-judgment processes. The issue here is whether evaluators should 
treat these judgments as relevant data. 


Many of those who write about educational processes take objectives 
as a starting point. Objectives are high-value targets. Objectives presumably 
identify the outcomes that someone thinks are most worthy, and all the — 
unmentioned outcomes presumably are less worthy. Listing objectives can 
be thought of as selecting a few more-valued goals from a vast multitude 
of possible goals. The list is always an oversimplification. Goal-stating 
succeeds to whatever extent it succeeds because people are tolerant of 
omissions: particularly omitted superordinate goals and omitted statements — 
of conditions. With any statement of objectives there are assumptions about — 
more basic needs being met. In the classroom it is assumed that certain 
essential educational skills—sentence making and reading comprehension — 
and shutting up and refraining from bodily threat to the teacher, for 
example—will be maintained. When these student skills falter, the teacher 
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is likely to abandon stated objectives to attend to the “essentials.” Also, 
there are unstated but expected conditions for the instruction. As conditions 
change, objectives are modified. In other words, no set of stated objectives 
is a fixed and final list of what the teacher will or should attend to. The 
list does reveal what its authors judged to be worth special attention. 

A list of objectives is an expression of attitudes. It is a valuable, 
sometimes essential, component of evaluation. It can be obtained by many 
devices, including attitude scales. On any educational scene there will be 
more than one set of objectives. Different groups have different attitudes, 
different objectives. Objectives are judgment data better treated by the 
rules that govern mass subjective responses than by the honors bestowed 
upon “fundamental truths,” Objectives, like attitudes in all their subjectivity, 
can be collected and scaled objectively. 

Sometimes it will be useful to compare objectives to generalized value 
commitments. Personal values are judgment data. In chapter 3 in this 
issue of the Review, Westbury indicates the relevance of value analysis to 
evaluation. Objectives can take on different meanings depending on the 
values behind them. A classroom emphasis on contemporary rock music 
may be rooted in a teacher’s desire to improve the social conscience or in 
a teacher’s desire to share aesthetic experiences. A workshop on behavior 
modification may be the creation of someone who seeks a more rational 
approach to instruction but also of another who sees an opportunity “to 
make a fast buck.” Two bases of values analysis suggest themselves to the 
evaluator: (1) a logical basis to check on the reasonableness of the selection 
of objectives, given a certain value position (see Jensen, 1950); and (2) 
an empirical basis to see how broadly (e.g, in the community, among 
the faculty; see Larkins and Shaver, 1969) certain value-positions are held 
and how desirable certain objectives are perceived to be. Understanding 
value positions may be a shortcut to understanding educational objectives, 
or it may not, The evaluator needs to know what kinds of knowledge 
his clientele and audiences can use. 

A knowledge of values should facilitate the specification of objectives. 
The evaluator should be acquainted with various schemas for categorizing 
values (Whitehead, 1929; Vernon and Allport, 1931; Oliver and Shaver, 
1966; Scriven, 1966). 

In Table I is an illustration of 16 value-laden broad educational 
objectives (Downey, 1960). Whether it would be useful to relate specific 
program objectives to these 16 broad objectives (or any other values 
typology) is something for the evaluator to consider when drawing up plans 
for a study. 
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TABLE I 
DIMENSIONS OF THE TASK OF PUBLIC EDUCATION: 
A CONCEPTUAL FRAMEWORK 


A. Intellectual Dimensions 


. Possession of Knowledge: A fund of information, concepts. 

. Communication of Knowledge: Skill to acquire and transmit. 

. Creation of Knowledge: Discrimination and imagination, a habit. 
. Desire for Knowledge: A love for learning. 


AOD 


B. Social Dimensions 


. Man to Man: Cooperation in da -to-day relations. 
. Man to State: Civic rights and duties, 

. Man to Country: Loyalty to one’s own country. 

. Man to World: Inter-relationships of peoples. 


PONDU 


C. Personal Dimensions 


Physical: Bodily health and development. 
10. Emotional: Mental health and sta ility. 
11. Ethical: Moral integrity. 

12. Aesthetic: Cultural and leisure pursuits. 


o 


D. Productive Dimensions 


13. Vocation-Selective: Information and guidance. 

14. Vocation-Preparative: Training and placement. 

15. Home and Family: Housekeeping, do-it-yourself, family 
16. Consumer: Personal buying, sellings, and investment. 


The task of measuring values is difficult, Boulding (1957) said. Values, 
he claimed, are heterogeneous aggregates. But the elements must have some 
similarities or they would not be recognized as values, The evaluator’s task 
is to reduce the apparent heterogeneity to a manageable representation, to 
separate the things-people-want-most from the other things, in a simple 
yet valid way. If in a given school situation a peoples’ wants can be gathered 
into large groups of wants, one can label each group an educational value. 

Another form of judgment data is the priorities given to certain objec- 
tives. The literal meaning of Priority indicates “what comes first in the 
sequence of events,” but the meaning here concerns relative importance. 
A list of objectives implies priorities; those expressed objectives have been 

i an certain other objectives, a crude 
ed that make finer gradations of im- 
hat kind and amount of emphasis will 
re unlimited resources or if all objectives 
able, it would not be so important to 
it is important not only to choose the 
locate scarce resources to each of the 


specify the priorities. In actuality, 
objectives to be pursued but to al 
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several objectives. Of course, professional people do both—but usually 
intuitively. That these selections are not explicit and conscious in even 
the most prescriptive training programs is partly a reflection of the difficulty 
of stating “specific priorities’ but mostly a reflection that people at all 
levels of expertise make most decisions intuitively. It may be true that 
some things will always be done better intuitively. We researchers need 
to find out whether or not specific priorities aid the direction of school 
programs. It is my belief that excessive attention has been given to precise 
goal-specification and insufficient attention to statements of priorities. 


It would follow that the evaluator and the educator should have some 
procedures by which they can translate objectives into priorities when 
given an array of needs, a ledger of resources, and some knowledge about 
what the results of various instructional procedures are likely to be. Since 
there is no formula for deriving these priorities, the alternative is to find 
out what people want. What do staff, lay leaders, students—whoever 
are the important people—say should be the allotments of time, the alloca- 
tion of concern and incentive, and the extent of remediation in the face 
of failure? These priorities are also judgment data. Pheiffer (1968) 
introduced a few projects in which systems analysis has been used to 
quantify judgments and attach priority numbers to alternate objectives. 

A fourth kind of judgment data deserves emphasis in an evaluation 
study: standards. Used here the term standard means a desired level 
or quality of something as cited by an authority. Standards answer the 
question “How much is good?” Standards are another form of objective: 
those seen by outside authority-figures who know little or nothing about 
the specific program being evaluated but whose advice is relevant to 
programs in many places. 

When a local educator sets up the equivalent of a standard (e.g. 
students should score at the national mean on a reading test, should type 
at 50 words per minute, or should stay out of jail during the course of 
study), that standard is usually called an objective. When an authority 
figure, unaware of the means by which the local teachers will pursue 
objectives or even of what the local objectives are, indicates a desired 
level or performance or a desired environment, it is more likely to be called 
a standard. When President Nixon’s Committee on National Goals reports, 
it is likely to speak of criteria and standards—criteria to tell Americans 
what goals they should pursue and standards to prescribe a successtul 
pursuit. 

Standards can be specific or broad. Spokesmen for mathematics 
teachers from time to time specify minimum competencies for all young 
people, and spokesmen for the American Library Association express their 
ideas about exemplary library facilities for certain kinds of schools. These 
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are standards. Standards are another part of that background of values 
needed for the definition of success and failure. They are data usually to 
be treated more as expert legal testimony or as historical documentation 
than as population statistics. Yet the difference between casual treatment 
and deliberate and orderly treatment of such judgment data may be the 
difference between a perfunctory description and a sound base for decision- 
making. Taylor and De Corte (1969) cxamined the distinction between 
norms and standards and summarized the literature on reaching empirical 
definitions of standards, 

Two obvious forms of judgment data remain to be mentioned. Many 
educational activities are undertaken to change student attitudes. Affective 
domain outcome data are judgment data. These will not be discussed 
directly in this chapter though many of the techniques cited are potentially 
useful. Krathwohl et al. (1964), Davis (1964), and Mayhew (1965) made 
major contributions toward the understanding of affective outcomes. 

The last form of judgment data is perhaps the most critical to good 
evaluation: the summative judgments people make about the overall 
Program or some component of it. Many different viewpoints will be 
gathered in some evaluation studies. The evaluator needs a tool kit of 
techniques for gathering and Presenting these data. Such techniques will 
not be reviewed in this chapter, although ways of discovering what the 
educator should do will often be appropriate for discovering the worth of 
what the educator did do, 

Additional References: Kluckhohn (1951); Churchman and Ratoosh (1959); 
Kuhn (1963); Educational Policies Commission (1961); Taylor (1970). 
Evaluation Theory and Procedure 


quality-control. Lortie (1967) perceived the 
and desirably pluralistic; he found little 


promise in rational and Prescriptive evaluation plans designed for relatively 


value-free data, 


The evaluation literature is not the place to look for procedures for 
analysis of judgmental data. Metfessel and Michael (1967, p. 935) identified 
the following as one of eight major phases of an evaluation study. 
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7. Interpret the data in terms of certain judgmental standards 
and values concerning what are considered desirable levels of 
performance on the totality of collated measures—the drawing 
of conclusions which furnish information about the direction 
of growth, the progress of students, and the effectiveness of the 
total program. 


But in a detailed appendix of data-gathering techniques they did not 
identify ways of gathering those judgmental data, standards, and values. 
In perhaps the most thorough large-scale evaluation study ever conducted, 
Smith and Tyler (1942) discussed the contributions of measurement data 
to guidance and administrative decisions; but they did not rely on standard- 
ized procedures to obtain and analyze judgment data. They accepted 
whatever objectives the educators gave them. Considering only the formal 
evaluation procedure, they chose to examine the Eight-Year Study objectives 
neither against other objectives, nor as formal transformations of values, 
nor in terms of the specific priorities given to them. A large majority of 
contemporary evaluation plans do likewise. Manuals and guidelines (Grob- 
man, 1968; Tyler, 1968; McIntyre et al., 1969; Southwest Educational 
Development Laboratory, 1969) for project evaluation typically call for 
gathering statements of objectives without reference to their value loadings. 
They require no formal attention to priorities or standards. Such de-emphasis 
of judgment data was challenged by Glass (1968). He objected to Bloom’s 
proposal to subsume evaluation methodology under contemporary testing 
methodology. Scriven’s criticism (1967) of Cronbach’s (1963) paper was 
another effort to avoid capture of evaluation methodology by the psycho- 
metric research establishment and to emphasize the specific “goods and 
“bads” of classroom instruction. Scriven did not, unfortunately, say that 
judgment data should be treated as the social psychologist would treat 
“preferences” or as the economist would treat “utilities.” He did not 
even say that the educator should receive an objective report as to what . 
values are placed on various things by various groups. He did say that 
curriculum and instructions should be subjected to value analyses such as 
the philosopher or historian might employ. Berlak, in chapter 4 of this 
Review, appears to agree with Scriven in preferring logical analysis to 
empirical. Stake (1967a) voted for empiricism, urging the development of 
a social-science-based technology to handle judgment data. 

For his taxonomy of educational evaluation designs, Worthen (1968) 
did not acknowledge that the extent to which an evaluation gathers and 
processes judgment data (as here defined) might be an important basis 
for differentiation among designs. This seems surprising since at the time 
he wrote the paper his academic adviser (Stufflebeam) had given as much 
attention to a formal plan for evaluating judgment data as any designer. 
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In Stufflebeam’s (1969) CIPP modcl,* the C stands for context, which 
is to be interpreted partly as the context of values, i.e., the standards and 
objectives of the program to be evaluated. Stake (1967b) likewise empha- 
sized value alternatives, assigning much of the total evaluation data matrix 
to judgment data with categories called “rationale,” “intents,” “standards,” 
and “judgments.” 


Professional judgment—implicitly demeaned in most evaluation de- 
signs—was given more appropriate honor in the “accreditation” model 
of evaluation (see Glass, in press). The accrediting associations and related 
agencies have been admirably thorough in considering the many parts 
of an educational system, resisting pressures that would force complex 
dimensions of value into an undimensional criterion of merit. But these 
professional associations have been as indifferent as the Tylerian evaluators 
and the systems analysts to the need for objective methods for handling 
judgment data. The following statement taken from Evaluative Criteria 
(National Study of Secondary School Evaluation, 1969, p. 1:9) illustrates 
the personalistic standards solicited by this evaluation approach: 


The checklists and evaluations should be evaluated on the follow- 
ing four-point scale: 


4 Excellent 

3 Good 

2 Fair 

1 Poor or missing 
na Not applicable 


Question will frequently arise about the basis for comparison 
of points on the scale. The answer is extremely difficult to give. 
In any entity as complex as a school, it is not easy to describe in 
detail what excellent or poor really means in the hundreds of 
items for which evaluations are required. The best answer seems 
to be that the evaluator should draw upon his total experience 


in schools and make the best judgment he can on the basis of 
that experience. 


Most enthusiastic advocates of “experiential” standards would agree that 
steps can be taken to communicate more clearly what a particular rating 
means by anchoring meaning in illustration and by sharing meaning 


through programatic training and use. The methods cited on the following 
pages could improve this communication. 


Additional References: Hyman and Wright (1967); Thomas (1968); Ameri- 
can Institutes for Research in the Behavioral Sciences (1969). 


*C = Context; I = Input; P = Process; P = Product 
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Methods of Gathering Judgment Data 


In this section are reviewed some of the instruments and procedures 
available to evaluators for gathering data on objectives, values, priorities, 
and standards. Identified are three situations in which (from time to time) 
the evaluator works. In one, he solicits judgmental responses with a 
standardized protocol from a group of individuals and then aggregates 
these responses to describe the value commitments of that group. In another, 
he employs one or more “experts”—with or without a checklist or other 
structuring device—to make personal observations of events or processes 
and then to describe the value commitments apparent in them. In the third 
situation the evaluator employs experts to analyze documents—e.g., laws, 
curriculum guides, textbooks, or critical reviews—and to reconstitute the 
value commitments made in creating them. In any one study, of course, 
he might do all three. In this section also are citations of the technical 
means used by pollsters, observers, and analysts to gather judgment data. 


Instruments for Aggregate Data 


The most common procedure for getting judgment data about a 
group is to persuade members to indicate their individual viewpoints and 
to aggregate these in some way. The description of the whole is the aggregate 
descriptions of the parts. In this subsection are considered four ways of 
getting such data: (1) surveys, (2) scaling, (3) Q-technique, and (4) 
the semantic differential. These four techniques could be used in a pretest- 
posttest design to study attitude change as an outcome of an educational 
program; but the purpose of the discussion in this chapter is to consider 
background attitude status to probe the complex of values held by various 
groups. Regardless of whether attitude change is an objective, community 
and staff values influence teaching and learning. Those are the values 
that define the success of the program, and those are the values the reader 
should keep in mind as methods of data collection are identified. 

Surveys are undertaken to obtain categorical answers to specific 
questions from a particular group of people. They may involve personal 
interviews or paper-and-pencil questionnaires. To get acquainted with survey 
methods, the evaluator should read Hyman (1955), Stephan and McCarthy 
(1958), and Trow (1967, 1969). A list of currently used educational 
questionnaires is published periodically by Gleazor for the American Council 
on Education. Major evaluation studies should rely on professional assis- 
tance. It is available commercially from such agencies as Gallup Associates 
of Princeton, New Jersey, and the National Opinion Research Center of 
Chicago (see NORC, 1944). Professional assistance is also available on 
campuses which have survey-research offices. Directors of lesser studies 
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can develop or adapt their own procedures from those reported in the 
literature. 

The educational-research literature describes few major surveys. The 
perspectives of education held by Catholics were surveyed by Greeley and 
Rossi (1966) and Neuwein (1966). Much earlier, Sandifer (1943) looked 
at lay perceptions of Progressive Education. Many non-generalizable studies 
of local circumstances are commissioned by school administrators. Most 
often these have an emphasis on economic characteristics; occasionally they 
attend to personal-value positions. (See Mort and Furno, 1960, and James, 
1963, for illustrative studies; see Furno, 1966, for a review.) 

Surveys are valuable when good data can be obtained by direct 
questioning. Often, however, ideas are vague, a single question is ambiguous, 
or meanings are personal and obscure. A more redundant and probing 
method is needed. When an objective or point of view is important enough 
to justify a more costly search and elusive enough to defy direct questioning, 
the evaluator may switch to rating scales. A display of many rating scales 
and a good introductory presentation were made by Shaw and Wright 
(1967). Guilford’s Psychometric Methods (1954) is useful for a review 
of such classical topics as pair-comparisons and the time-order error. 
More theoretical treatments of scaling were developed by Torgerson (1958) 
and Coombs (1964). 

Coughlan (1969) used pair-comparisons to study teachers’ work values. 
Sjogren, England, and Meltzer ( 1969) devised an instrument for assessing 
the personal value-orientation of administrators. Gorlow and Noll (1967) 
developed an instrument for use with college students. Its forced-choice 
items were based on eight previously researched value dimensions. A number 
of scales designed to measure attitude and self-concept were described by 
Dowd and West (1969). 

Messick (1961) used multidimensional scaling to portray the political 
preferences of lay adults, The advantage of the multidimensional approach 
is that dimensions (in this case, value orientations) do not have to be 
hypothesized in advance; the disadvantage of a multidimensional study 
is that the resulting dimensions are usually difficult to interpret and label. 

A special use of rating scales (including a special factor analysis) 
was developed by Stephenson (1953). He called it the Q-technique. Its 
most common component, the Q-sort, is briefly and nicely described by 
Nunnally (1959). Downey (1960) used a Q-sort, for which the respondent 
sorted 16 stimulus cards into a 1-2-3-4-3-2-] distribution of frequencies, 
to obtain priority values for global educational objectives. Stephenson 
showed how such sortings could be factor analyzed ipsatively (correlating 


persons instead of correlating scales) to get profiles for the individual 
respondent. 
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In a series of outstanding studies of educational values, Kerlinger 
and his associates used the Q-technique and other factor analytic approaches. 
They found two basic independent attitudes: that toward progressivism 
and that toward traditionalism. They did not find these to be, as many 
would expect, opposite poles of a single continuum but independent factors. 
In other words, they found that knowledge of a person’s support for 
experimental projects, reconstruction, and life adjustment curricula (pro- 
gressivism) is not a sound basis for predicting his criticism of the school, 
his esteem for knowledge, or his educational conservatism (traditionalism). 
But things within one of these two clusters would predict others within 
that cluster (Kerlinger, 1967; Kerlinger and Pedhazur, 1968; Sontag, 1968). 

The semantic differential is a special scaling procedure for fixing 
meaning to objects and ideas. Developed by Osgood, Suci, and Tannenbaum 
(1957), the procedure enables the researcher to explore what people perceive 
to be the meaning of a concept. In education an evaluator may examine 
such concepts as “compensatory education,” “state aid to parochial schools,” 
“new math,” or even “my teacher.” When the search is for judgmental 
data, those concepts will be delineated by such descriptive scales as 
good-bad, needed-not needed, and useful-useless. Osgood and many re- 
searchers have used the semantic differential more to explore the dimensions 
of meaning people utilize than to explore the meaning of single concepts 
(see Snider and Osgood, 1969, for an excellent bibliography). For that 
goal, the factor analysis factors are more important than the specific concept 
descriptions. Since the evaluator is responsible for describing the value- 
background of a particular educational program, the semantic differential 
often yields descriptions more vague than he can use. Much time can 
be wasted trying to interpret new scales and new factors. The evaluator 
can probably make better use of the semantic differential by using scales 
from previous studies and by interpreting the results in terms of findings 
of those studies. 

Geis (1968) examined the usefulness of the semantic differential 
for evaluating a course-content-improvement project (Harvard Project 
Physics), but his interest was in measuring student understandings rather 
than perceptions of the curriculum, Wittrock, Wiley, and McNeil (1967) 
used the technique (in a way consistent with the aim of this chapter) to 
examine what the concept “public school teachers” means. Harvey et al. 
(1968) related semantic differential data on teacher beliefs to classroom 
climate and student behavior. Taylor and Maguire (1967) used the semantic 
differential to study high school biology objectives. This last work is 
described in the section to follow in “Putting Objectives to the Test.” 

Many project evaluators, e-8. Peckham (undated), are using the 
semantic differential and other preference scales in project evaluation. 
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Though their findings may not be generalized beyond the confines of 
their project, it is regrettable that their reports are not available for the 
guidance of other evaluators designing studies and selecting instruments. 
Some such reports can be found in Research in Education; but since the 
ERIC items are classified by what-was-studied and who-was-studied rather 
than by the instrumentation, the search is a taxing one for the methodologist. 


Additional References: Oppenheim (1966); William and Roberson (1967); 
Roper (1950); Wehling and Charters (1969). 


Observation and Expert Review 


How important things are to people can be pretty obvious. What 
they do longest and with most enthusiasm and what they work hardest to 
repair are what is important to them. The evaluator’s problem with many 
judgmental data is not in gathering them, but in gathering them in such 
a way that the report readers, those who are not there to observe, can 
understand what was seen. The readers need some basis for deciding what 
confidence to place in the report. The observer and the reviewer need 
protocols, guides, routines to alert themselves by predetermined schedule 
to important features so as to give their observations the replication upon 
which confidence can be based. An excellent example is the form Westphal 
and Boldt (1970) developed to record a college physics lecturer’s priorities 
in an objectives-by-audiences matrix. 

Helen Peak (1953) reviewed the problems of gathering observational 
data in the Festinger and Katz handbook, Research Methods in the Be- 
havioral Sciences. She emphasized the need for thorough training of 
the observers. Samph (undated) gathered evidence that the observer is 
likely to influence what happens in the classroom. More indirect ways of 
gathering observational data were suggested by Webb et al. (1966). 

A number of techniques (e.g., Flanders Interaction Analysis) were 
developed for observing the social and instructional interaction among 
teachers and students. These are discussed by Rosenshine in chapter 5 in 
this issue of the Review. By and large, the techniques do not spotlight 
the educational criteria and standards operating in the classroom; but the 
evaluator may find them illustrative for developing his own protocol. 

The shortage of procedures for making systematic observations of 
educational activities is particularly dismaying because the site visit is a 
widely used evaluation method. When a large-scale program is under 
way at some distant place, the most common way to evaluate it is to appoint 
a small number of respected persons to go there and inspect it. This method 
receives a proper share of criticism. It is evident that the program staff 
works hard to make the Operation atypically handsome during the visit 
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and the visitors grasp at the slimmest shred of evidence for something to 
report. Despite these defects, the method of site visits deserves its eminence 
because it is designed for the most sensitive instruments available; expe- 
rienced and insightful men. Furthermore, it is capable of quick adaptation 
to local circumstances. Its failings could be remedied by heeding Helen 
Peak’s advice: train (even briefly) the visitors and provide a set of standard- 
ized scales for directing visitor attention for describing what is seen. 

Most site visitors for high school accreditation are at least indirectly 
guided by the National Study Evaluative Criteria (1969), a publication 
which probably directly guides the lengthy “Self-Study” which precedes 
many visits. This publication gives the visitor looking for judgmental data 
many important topics to consider but, as mentioned before, little basis 
for getting suitable data. 

Most site visitors to colleges in midwestern states are guided by the 
North Central Association’s Guide for the Evaluations of Institutions of 
Higher Learning (1965). This document draws attention to seven basic 
questions but leaves the scaling to the ingenuity of the visitor. 


1. What is the educational task of the institution? 

2. Are the necessary resources available for carrying out the task 
of the institution? 

3. Is the institution well organized for carrying out its institu- 
tional task? i 

4. Are the programs of instruction adequate in kind and quality 
to serve the purposes of the institution? 

5. Are the institution’s policies and practices such as to foster 
high faculty morale? 

6. Is student life on campus relevant to the institution’s educa- 
tional task? 


Workers at the Educational Testing Service are developing some needed 
scales; one of them is called the Institutional Functioning Inventory (Peter- 
son, undated). 


Most site-visit evaluators dispatched by the U. S. Office of Education 


to examine national laboratories and R & D centers are directed (USOE, 


undated; Chase, 1969) by broad criterion questions to identify the extent 
to which these units have become well managed and productive organiza- 
tions. Here, in the Tylerian tradition mentioned earlier, no one is encour- 
aged to discover whether or not the objectives are locally controversial 
or misstated or inconsistent with institutional philosophy or standards. 
A special-purpose site-visit plan (DESDEG) that does give such encour- 
agement was developed by Renzulli and Ward (1969) to guide the outside 
evaluator who is evaluating a program for gifted children. 


193 


REVIEW OF EDUCATIONAL RESEARCH ; Vol. 40, No. 2 


For another type of scrutiny, documents and artifacts can be sent 
from the school or project to the experts. (See Welch and Walberg, 1968.) 
Expert review is an important evaluation technique for textbooks, achieve- 
ment tests, audio-visual materials, instructional and audio- and video-tapes, 
Again the purposes of evaluation may be best served by some standard 
analytic device for highlighting objectives or standards. Perhaps the 
greatest criticism of the expert reviews of educational tests edited by Buros 
(1965) is that the reviewers are not guided by a checklist or common set of 
standards. This is not to say that reviewers are unaware of basic principles 
of testing as summarized in Lindquist (1951) or of such guides as the 
Bloom Taxonomy (1956) for identifying the purposes of test items. Knowl- 
edge of principles and possible purposes does little to assure that a reviewer 
will look at values and expectations of different groups of users. If value 
data are important, the reviewer must be coached to look at values. This 
is more likely to happen in the review of textbooks because Gordon (1967) 
and Morrissett and Stevens (1967) have provided outlines that direct 
attention to judgment data. For the review of tests and tapes the priorities- 
seeking evaluator is pretty much on his own. 

The thorough evaluator is tempted to analyze the documents of the 
community, the newspapers, and the minutes of meetings to learn how 
ideas and values have fared across time. Researchers call the technique 
content analysis. Berelson (1952) identified three situations for using 
content analysis: (1) when the researcher is curious about the contents 
themselves, (2) when he seeks inferences about the producers of the content, 
and (3) when he seeks to understand the audience that would use the 
content. Research on values and objectives fits nicely into both categories 
(2) and (3). Cahen (1970) modified a discomfort-relief quotient developed 
by Dollard and Mowrer (1947) to analyze the program of a professional 


Reitz (1967). In keeping with his purposes, however, the evaluator should 
recognize the differences between research methods and evaluation methods 
(Cartwright, 1953, 449-454; Stake, 1969). Unlike the researcher the 
evaluator is not obligated, nor does he have good opportunity, to generalize 
Programs or procedures. For describing the specific 
project, he may find some methods of the public-relations man appropriate; 
the writings of McCloskey (1959) and Kindred (1960) may be instructive. 
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The best place within the educational-research literature for the 
evaluator to find aids to analysis of casual writing is that dealing with 
essay grading. The outstanding work in this area has been done by 
Godshalk, Swineford, and Coffman (1966) and Gosling (1966). Here as 
with so many psychometric studies the focus of attention has been on 
writing abilities rather than on what is needed here—the identification 
of personal concern in discourse or narrative. 


Page is one researcher in this area who has realized the potential 
of the computer for discerning writing ability, writing style, content familiar- 
ity, and even personal concern. In his Project Essay Grade (1966) he 
found that his computer program could identify characteristics of student 
essays that correlated as well with English teacher gradings as the teachers’ 
marks correlated among themselves. The promise of this work, it seems, 
lies in giving educators indicators for monitoring routine teaching and 
learning, not in developing a basis for substituting for teacher judgments. 
The general potential of natural language processing was discussed by 
Stone et al. (1966). The day when computers could aid the values analysis 
of curricular materials seems to be neither here nor far away. 


Additional References: Payne (1969); Rosenshine (1969); Malinowski 
(1961). 


Using Judgment Data in Evaluation Projects 


In this section the uses of judgment data are examined. In the previous 
sections it was pointed out that judgment data are part of the context of 
education, that what is taught and what is learned are partly determined 
by personal preferences. An evaluator should realize that his audience can 
only poorly understand a program if they have no information on what 
is seen to be worth doing by those who are doing it. It is important to 
know what motivates Johnny, the learner; but for the program evaluator 
it is at least as important to know what motivates those who want Johnny 
to learn. Judgment data help show that the design of the program makes 
sense, or that it does not. Without judgment data an evaluator cannot 
show that a program is succeeding. 

The evaluator should consider not only how educational objectives 
manifest themselves in teaching and learning, but also how those objectives 
embody the aspirations and discontents of the people involved. It is sup- 
posed that such motives were taken into account when objectives were 
established, but it is also supposed that students learn what they are taught. 
Evaluation is twice needed. 
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The development of sets of objectives has been thoroughly described 
elsewhere (Mager, 1962; Suchman, 1967; Eisner, 1969; Baker, 1969). 
Herzog (1959) romanticized the definitional problem with the words: 


Big criteria have little criteria 
upon their backs to bite ‘em. 
The small ones have still smaller, 

and so on ad infinitum. 


Krathwohl (1965) made a valuable contribution; he identified the stages 
of refinement of objectives and argued for performance objectives as the 
final product of the transformation. In the public health field Suchman 
(1967) illustrated a chain of objectives progressing from the most immediate 
practical objectives to the ultimate ideal goal. Taylor and Maguire (1966) 
presented the same ideas in a formal linear model, with societal press 
as the origin of objectives and criterion behavior as the outcome. In contrast 
to most followers of Tyler and Mager, Taylor and Maguire said that the 
origin of objectives should be the values of the people involved, not just 
the aims of the professional educators. The teacher serves an important 
function—as do the principal, board of education, textbook writer, and 
others—in translating national purposes and community needs into lesson 
plans. The evaluator should be able to give a clearer and more valid 
representation of community needs and generalized values. In the 
summative-evaluation sense and as an aid to planning and carrying out 
his study, the evaluator should display objectives against community value 
data to show what is congruent and what is not. 


Putting Objectives to the Test 


i Few investigators examine empirically the relationship between values, 
objectives, and priorities. Maguire and Taylor have. Maguire (1968) 
obtained teacher value-ratings of a heterogeneous set of objectives, then 
from the same teachers got different expressions of the priorities that 
should be given to these objectives. He found that the teachers perceived 
the objectives in at least four dimensions; different in Subject-Matter 
Value, Motivational Qualities, Ease of Implementation, and Statement 
(Semantic) Properties. In another study, Taylor and Maguire (1967) 
obtained value-ratings of high school biology objectives from three im- 
portant groups: subject-matter experts, curriculum writers, and biology 
teachers. The investigators found substantial agreement in group view- 
point, with experts and teachers least alike. In an earlier study Taylor 
(1966) used various scaling techniques to find specific points of disagree- 
ment about topic priorities. He worked with such specific topical assignments 
as “the study of structure versus function” and “the study of the 
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biological roots of behavior.” These studies are valuable, I believe, much 
more because of the direct attack on the problems of measuring priority 
than for the guidance they give to the science educator. 


According to the linear model visualized by Taylor and Maguire 
(1966), educational objectives are an intermediate product of a logical 
operation. Although the model represented societal press as generalized 
personal behaviors, it is easier for most curriculum analysts to represent 
it as social values. Whichever way, such judgmental data can be presented 
in matrix form. The input to the process is a matrix of values, the output 
is a matrix of objectives. For each personal point of view there is an input 
values profile, a row of the input matrix. For each point of view there 
would be a profile of intended outcomes, a row of the output matrix. 
The entire data collection on values thus would be represented by a 
persons-by-value-position matrix, and the data collection on objectives 
would be represented by a persons-by-objectives matrix. The elements 
within the matrices would be numerical ratings or priorities. The similarity 
of the values matrix to the objectives matrix can be examined, by eye 
or mathematically. 


This matrix representation provides a way of displaying judgment 
data but also a platform for considering the question “By what transforma- 
tion (e.g., matrix algebra operations) do objectives derive from values? 
An educator probably responds to the value patterns of his people in 
ways that reflect his awareness of (1) the needs of those who are to 
benefit from education, (2) the resources available for education, and (3) 
the probability that any given way of teaching would alleviate a need. 
A researcher seeking to explain educator behavior may need to include 
these three domains in his theory. And an evaluator seeking to aid an 
educator may gather data from these three domains.” 


It is reasonable to assume that the educator sets lower program 
priorities on those things held in high value but for which he sees no 
current need and for which he sees a need but no sufficiently inexpensive 
or potentially successful educational strategy. This suggests that what 
an educator or any other person knows about student need, potentially 
successful pedagogy, and instructional resources may help in the analysis 
of the objectives he would emphasize for a school program. The evaluator 
who thinks along these lines quickly becomes reminded of the need for 
research findings on the functional relationships between various amounts 


*If an evaluator desires to find the main ways his judges or supervisors are reacting to 
such things as objectives or standards, he may use regression analyses. Maguire and Glass 
(1968) and Schenck and Naylor (1968) described how regression can be used to cate- 
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of input and output as bases for educational decisions, a topic to be 
discussed later, Cronbach and Gleser (1965) established a useful method- 
ological precedent in the field of personnel decisions.* 


In the administration of Title III programs for educational innovation 
and supplementary services, the U. S. Office of Education has paid increas- 
ing attention to “Needs Assessment.” It has mandated state-by-state 
studies. Some efforts to satisfy this requirement are being made as if need 
were a non-reactive characteristic (such as age, geographic location, or 
loudness) and not dependent on who is being asked to describe it. To 
describe need is not only to describe the something but to describe the 
persons asked. People see need differently. The evaluator has no obligation 
to find consensus. In fact, he is acting improperly if he does not report 
the diversity of viewpoints of need (McLure, 1968). The developers of 
a state plan do have to find a compromise. Pennsylvania now has its 
plan (Educational Testing Service, 1965). A reader specifically interested 
. in Title III might use the needs-assessment plan for Florida as a model. 
(Florida Educational Research and Development Council, 1968.) Popham 
(1969) recognized the important need in Needs Assessment for empirical 
data on preferences. He equated critical needs with discrepancies between 
(1) preference of items from the UCLA Objectives Exchange and (2) 
performance levels on still-to-be-constructed-or-selected National-Assess- 
ment-like criterion-reference achievement items. (See Popham and Skager, 
1968; Department of Elementary School Principals, 1967.) 

Dorothy Fraser (1963, p. 105,) speaking for the National Committee 
of the NEA Project on Instruction, recognized the criticality of priorities 
but avoided the educational-technologist procedural question saying, “There 
is no set of specifications for a balanced curriculum which can be applied 
to every school in the United States, just as there is no uniform prescription 
to determine what should be included and excluded from the school 
program.” The idea that how-to-translate-local-objectives-into-local-pro- 
gram is only a local matter is unacceptable. The National Committee did 
review the important subject-matter areas; it should also have addressed 
itself to the questions of “How can the profession help the community see 
and adjust the ‘balance’ in its curriculum?” 

Such techniques as PERT (Cook, 1966) or the curriculum-development 
models of Gagné (1967) or Hively, Patterson, and Page (1968) are helpful 
in ordering what to do first and what to do second; but only in the hands 


of a skilled manager will they be helpful to the educator trying to give 
specific priorities to different objectives. 


*I do not imply that the best way to generate objectives is to consider values and needs 


; ec 
ee on objectives may be better understood if data on values and 
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Additional References: Ammons (1964); Becker (1950); Wharton (un- 
dated); Baker (undated); Popham and Baker (1965). 


Methods of Reporting Judgment Data 


Like much educational data, value preferences are difficult to summarize 
because their dimensions—although part of common thought and dis- 
course—are defined and scaled differently by different people. Two people 
often do not mean the same thing when they say, “an asis on 
critical reasoning” or “a substantial rise in morale.” Furthermore, most 
readers resist technical definitions and standardizations that do not fit 
their own conceptualizations. To get his message across, the evaluation 
reporter must insist that the reader study the instruments and procedures 
used, that he note the language of individual items or classifications, and 
that he appreciate the conditions in which the data were gathered. There- 
fore, all this information must be available to the reader. These constructs 
and conditions are important as background, but they do not necessarily 
identify any causes of success or failure. They are valuable as a safeguard 
against unwarranted generalization by the reader, They help him establish 
limits. 

The evaluator’s scales and checklist items are often quite general; 
effects vary little with different wording or administration. Sometimes there 
are differential effects. This background information is needed because 
the report reader needs to be reminded that it may or may not be appro- 
priate for him to relate new findings to a particular knowledge he has. 
The report cannot authorize him to generalize, but it can stimulate his 
thinking about possibilities and limits to generalization. Though true of 
any evaluation data, this is especially true of judgment data. 

Many judgment data will be reported in narrative form. Havighurst 

(1964); Cate (1966); and Hunt, Hardt, and Victor (1968) provided good 
models. Verbatim quotations are widely used (e.g-, Gooler, 1969) to sum- 
marize participant critiques of workshops and institutes. 
Unlike research reporting, evaluation reporting is communicating with 
an audience that does not share the technical qualifications and special 
interests of the investigator. Textbooks on research methods are of little 
value for this responsibility. The difficulties of describing education to 
a lay audience were discussed by McCloskey (1959). Schramm (1954) 
identified four principles of message construction. Doob (1948) pointed 
out the good and bad contributions of photographs, drawings, and car- 
toons, Modley and Lowenstein (1952) discussed the development of graphs. 
Such readings are valuable but neglected guides for the evaluator reporting 
complex judgment data. 
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Educational-measurement specialists and education critics—often com- 
batants—jointly contribute to the communication problem by demanding 
and developing more locally relevant and currently relevant measures; 
when previously used measures would be more easily interpreted. It is 
regrettable that the items carefully developed by Downey (1960), Greeley 
and Rossi (1966), and Coleman et al. (1966) are not used repeatedly 
so that cross-study meanings can develop. Kurland (1966) and Tyler 
(1966) havé spoken eloquently about the need for standardized indicators 
of educational performance. 


A profile may be the best way to present data when a number of 


distinctive value-statements are used to describe an individual, a product, _ 


or a program. (See Steele, 1969b, for an example of a profile of classroom 
emphases derived from the Bloom Taxonomy.) Care must be exerted so 
central tendency and variance are equivalent from scale to scale. The 
comparison of profiles is a deceptively difficult task, as Helmstadter (1957) 
pointed out. 

The task of relating certain things to geographic location or to time 
is relatively easy. Numbers of symbols superimposed on a map and trend 
lines (changes in amount over time) are among the easiest graphics to 
understand. Most researchers try to put subtle relationships into graphic 
form and massive information into tabular form, but often the non- 
specialist reader will find them too subtle or too detailed. Even with moun- 
tains of data to report, Coleman (1966) maintained a high degree of 
readability. In one instance (p. 199) he described attitude (e.g., “People 
like me don’t have much of a chance to be successful in life”) differences 
among racial groups with tables showing percents of students who agreed 
and disagreed with the item. Downey (1960) used a symmetric matrix to 
obtain an easily read display of which pairs of groups of people emphasized 
different educational objectives. Time and Fortune magazines often include 
comprehensible and interesting formats for the discussion of complex 
problems. In evaluating New York City’s More Effective School Program, 
Fox et al. (1968) displayed in tables of proportions the distribution of 
parents perceiving the MES schools as better or worse than specific other 
schools. Further, they reported data on whom their respondents thought 
should participate in various school and classroom decisions. 

When descriptors are too complex to put into profile form, matrices 
(including correlation and factor matrices) may be useful. Programs for 
analyzing and simplifying these matrices are many and varied. A suitable 
introductory textbook or handbook is not available, but “multivariate 
analysis” specialists are available as consultants. Tucker (1969) extended 
the multidimensional scaling approach to identify individual person spaces 
in addition to the common object space. Maguire (1968) used a technique 
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developed by Schénemann (1966) to determine the similarity of value 
configurations. 


Additional References: Ford (1969); Schenck and Naylor (1968); Hovland 
(1959); Kochen (1967). 


Decision Making 


Judgment data enter into decision processes as inputs, not as outputs. 
Many decisions about running a program or maintaining a curriculum 
are made intuitively. The decisions per se are not judgment data. The data 
discussed in this chapter are those gathered as subjective information about 
the context and conduct of a program, information that may or may not 
enhance decision making. 

One major way of categorizing evaluation efforts is to distinguish 
between those that directly lead to administrative decisions about a particu- 
lar program and those that contribute to broad understanding of programs 
and only indirectly lead to corrective decisions about a particular program. 
Cronbach and Suppes (1969, pp. 20, 21) called the one decision-oriented 
research and the other conclusion-oriented. 


In a decision-oriented study the investigator is asked to provide 
information wanted by a decision-maker: a school administrator, 
a governmental policy-maker, the manager of a project to develop 
a new biology textbook, or the like. The decision-oriented study 
is a commissioned study. The decision-maker believes that he 
needs information to guide his actions and he poses the question 
to the investigator. The conclusion-oriented study, on the other 
hand, takes its direction from the investigator’s commitments and 
hunches. The educational decision-maker can, at most, arouse the 
investigator’s interest in a problem. The latter formulates his own 
question, usually a general one rather than a question about a 
particular institution. The aim is to conceptualize and understand 
the chosen phenomenon; a particular finding is only a means to 
that end. Therefore, he concentrates on persons and settings that 
he expects to be enlightening. 


Glass (1969) emphasized that designs for evaluation studies need to be 
influenced by judgments of the worth of the program as a socially bene- 
ficial activity. These decision-oriented evaluation studies will also vary 
in terms of how oriented they are to the specific decisions facing the 
curriculum developer, purchaser, or voter. 

Even in the more prescriptive and systems-analytic plans (Guba 
and Stufflebeam, 1968; Alkin, 1969) for evaluation, it is expected that 
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there will be a person or a group of persons who will examine the evidence, 
consider its credibility, question its relevance, assess its completeness, and 
then make decisions. No serious proposal has been made to pre-specify 
and quantify all these considerations so the decision could be completely 
automatic, One expects decision-makers to consider all the evaluation-study 
data in the light of their professional experience. Some judgmental data 
may tax their resolve to be rational and deliberate, but the ways of handling 
judgmental-data decisions are no different from ways of handling other 
kinds, 


A common assumption among evaluation users is that, if there is 
a discrepancy between an intended situation and an actual situation, the 
decision should change the treatment and eventually correct the actual 
situation, Lindvall and Cox (in press), for example, endorsed that assump- 
tion in their evaluation of Individually Prescribed Instruction. The Provus 
(1969) “Discrepancy Model” makes a contribution by acknowledging the 
alternative procedure, that of adjusting the intended situation (i.e., the 
‘objectives) as well as the treatment when discrepancies are found. 


Another important assumption regarding decision making is that the 
larger the discrepancy, the more attention that decisions should command. 
A school district’s fifth graders may be .5 grade equivalents behind an 
accepted norm in arithmetic and 1.5 grade equivalents behind an accepted 
norm in spelling—yet the arithmetic decision may be the more important. 
Community pressures, the difficulties and costs of remediation, the relevance 
of the testing instruments, and many other circumstances are pertinent 
to the importance of a decision. The priorities given to different discrepancies 
are additional data—data which can be obtained by the techniques dis- 
cussed in this chapter. 


Voicing a third assumption that most education technologists make, 
Maguire (1969) claimed, “It is important that the rationale for decisions 
at various levels be made explicit, so that conflicts [between curriculum 
developers and teaching personnel] can be resolved, thus maximizing the 
chance of achieving the broad objectives that are deemed significant by 
the society.” Explication as a development-tactic is still a debatable issue 
and one susceptible to empirical research (Marquis, 1969). Conflicts can 
be resolved without knowledge of the opponents’ rationales (and the 
maximum effort toward society’s goals may be stimulated more by conflict 
than conflict resolution). Educational technologists have been too reluc- 
tant to admit that intuitive decision making and irrational conflict resolu- 
tion often work very well. That the technologists themselves find precise 
definitions, quantification, and graphic presentations to be valuable aids- 
to-thinking may falsely persuade them that they are aids-to-all-thinking. 
The precious commodity that the technologists offer here is an emphasis 
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on arranging things so new information will emerge. This feedback may 
assist in the recognition of a problem or conflict and may assist in the 
correction of it (Lindbloom, 1969). It should be noted that feedback can 
happen without words. 

The political scientist, Pool (1968) predicted that foreign policy 
problems will be worked out by national leaders sitting at computer 
consoles, indicating their likes and dislikes, preferences for this and that 
solution, desire for increase and decrease of investment, etc. The Rand 
Corporation developed a technique called the Delphi Technique for just 
such conflict resolution (Helmer, 1966). Experts, authorities, or spokesmen 
of another sort are repeatedly presented new data on the current situation 
and the preferences of others; they are asked to make new observations, 
preference statements, or requests for data. The computer need not be a 
part of this procedure; but it can be a great aid, particularly if a large 
amount of data is available and if the human participants cannot all 
be present at the same time. Working on computerized policy-research 
games is a team headed by Charles Osgood at the University of Illinois 
Institute for Communications Research, One of its members, Stuart 
Umpleby (1969, pp. 10, 11) said, 


The most obvious problem is how people can communicate in 
a game situation without recognizing that they are dealing with 
a machine and not a human being. .. . Ina PLATO version . . - 
this would be done by typing messages on one’s own screen and 
the screen of a preselected player. If a computer-simulated player 
were to ‘understand’ messages on nearly open-ended subject 
matter, an extraordinarily complex content analysis routine would 
have to be written which would be very time consuming if not 
beyond the current state of the art. 


The alternative is to have players communicate with each other 
by multiple choice branching. The unfortunate disadvantage of 
using multiple choice branching is that human players are re- 
minded of all the alternatives open to them. . - « 


Here, as in so many situations described in educational-research publica- 
tions, the emphasis of the writer-researcher is on how people interact. 
But it is obvious to Umpleby and many others that these interaction 
techniques are endorsed by participants as having a practical value which 
enables them to come to grips with problems, On rare occasions, the 
interaction even seems to elicit an original solution. 


Additional References: Scheffler (1958); Taylor (1966); Dershimer (1968) ; 
Freeman et al. (1956). 
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Summary 


Robert F. Kennedy recognized the inadequacy of present-day value 
indicators. He was quoted by Ross (1969, pp. 351): 


We cannot measure national spirit by the Dow-Jones average or 
national achievement by the gross national product. For the gross 
national product includes our pollution and advertising for ciga- 
rettes, and ambulances to clear our highways of carnage. It counts 
special locks for our doors and jails for the people who break 
them. The gross national product includes the destruction of the 
redwoods, and the death of Lake Superior. It grows with the 
production of napalm and missiles and nuclear warheads. . . . 


There are many who believe that efforts to represent values in words and 
numbers are more hurtful than helpful, more a part of the problem 
than a solution to it. More power to them, even if they are wrong, they 
help to keep the quantifiers honest. Some evaluators, following their own 
values and exercising what little muscle they have, will try to measure 
things better and communicate them more clearly. 

Evaluators need to be specific about what they are doing, but they 
also have to be alert to things whose relevance is only slowly cmerging. 
Their plans need to be specific enough to show what they want most to 
discover and communicate—so that there is a basis for evaluating the 
evaluation—but open enough to report the unexpected. 


A general conclusion which seems to me to emerge from a 
historical approach—the examination of a number of research 
case histories—is that mankind consistently errs in the direction 
of lack of foresight and imagination. We continually under- 
estimate the power of science and technology in the long term. 
Eminently knowledgeable planners and scientists, in attempting 
responsibly to make realistic appraisals of research, and facing 
what is at the time uncertain or unknown, all too frequently fall 
short in foresight and imagination. The clement of surprise is a 
consistent ingredient in technological development, and one we 
have great difficulty in dealing with on any normal planning 
basis... .. (Townes, 1968, p. 699.) 


Maybe it would have been more appropriate in the opening paragraph 
if it had said that this review is a ade paraa wila Gd e a 
are not doing with judgment data. Few procedures have been cited that 
have been used successfully (or even tried) for making judgment data 
a part of the evaluation story. Excuses are many: “.. . too much to do.” 
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“Judgment data are messy.” “What people think doesn’t correlate with 
student gain any way.” But none of these excuses is adequate. It does 
not matter that the job is difficult. It does not matter that evaluators seldom 
find strong correlations (see Jewell, 1969; Winter, 1965) between back- 
ground conditions—including aims, needs, and standards—and educational 
outcomes. Evaluation audiences do believe that to understand education 
one needs to understand what people expect from education. As long as 
that is true, evaluators have an obligation to make a careful search for 
objectives, standards, and other judgment data. 
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2: POLITICS AND RESEARCH: 
EVALUATION OF SOCIAL 
ACTION PROGRAMS 

IN EDUCATION 


DAVID K. COHEN* 
Harvard University 


Although program evaluation is no novelty 
in education, its objects have changed radically. The national thrust 
against poverty and discrimination introduced a new phenomenon with 
which evaluators must deal: large-scale programs of social action in 
education. In addition to generating much activity in city schools, these 
programs produced considerable confusion whenever efforts were made to 
find out whether they were “working.” The sources of the confusion 
are not hard to identify. Prior to 1964, the objects of evaluation in educa- 
tion consisted almost exclusively of small programs concerned with such 
things as curriculum development or teacher training: they generally 
occurred in a single school or school district, they sought to produce 
educational change on a limited scale, and they typically involved modest 
budgets and small research staffs. 

This all began to change in the mid-1960’s, when the federal govern- 
ment and some states established broad educational improvement programs. 
The programs—such as Project Headstart, Title I of the 1965 ESEA, and 
Project Follow-Through—differ from the traditional objects of educational 
evaluation in several important respects: (1) they are social action pro- 
grams, and as such are not focused narrowly on teachers’ in-service 
training or on a science curriculum, but aim broadly at improving 
education for the disadvantaged; (2) the new programs are directed not 
at a school or a school district, but at millions of children, in thousands 
of schools in hundreds of school jurisdictions in all the states; (3) they are 
Not conceived and executed by a teacher, principal, a superintendent, or a 
researcher—they were created by the Congress and are administered by 
federal agencies far from the school districts which actually design and 
conduct the individual projects. 


— 

*Research for this paper was supported b from the Carnegie C tion 
po y a grant from the Carnegie Corpora! 

of New York to the Com for Educational Policy Research, Harvard University. Henry 

Dyer, Frederick Mosteller, and Martin Rein served as consultants to Dr. Cohen on the 

Preparation of this chapter. 
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Simply to recite these differences is to suggest major new evaluation 
problems. How does one know when a program which reaches more 
than eight million children “works”? How does one even decide what 
“working” means in the context of such large-scale social action ventures? 
Difficulties also arise from efforts to apply the inherited stock-in trade 
evaluation techniques to the new phenomena. If the programs seek broad 
social change, is it sensible to evaluate them mainly in terms of achieve- 
al If they are national action programs, should evaluation be decen- 
tralized? 


This chapter is an effort to explore these and other questions about 
evaluating large-scale social action programs. It has three major parts. 
First, I delineate the political character of the new programs, in order 
to distinguish them from the traditional objects of educational evaluation. 
Second, to illustrate this point and define the major obstacles to evaluation, 
I review some evaluations of the new programs. Finally, I suggest some 
elements of a strategy which might improve the evaluation of social 
action programs. 


Politics and Evaluation 


There is one sense in which any educational evaluation ought to 
es regarded as political. Evaluation is a mechanism with which the 
— of an educational enterprise can be explored and expressed. 
iia rin sag are managed by people, and they take place in institu- 

ii erefore, any judgment on their nature or results has at least a 
ek. poca impact—it can contribute to changing power relation- 
= as is true whether the evaluation concerns a small curriculum 
ve Program in a rural school (if the program is judged ineffective 

rector might lose influence or be demoted), or a teacher training 
Program in a university (if it is judged a success its sponsors might get 
a: Evaluation, as some recent commentators have pointed 
isita on ip ght which is at least potentially relevant to decision- 
; g e! » 1967; Guba, 1968). Decision-making, of course, 
5 @ euphemism for the allocation of resources—money, position, authority, 
ete. Thus, to the extent that information is an instrument, basis, or excuse 


for changing power relationships withi 
th SRT a 
paler Lolly ps within or among institutions, evaluation 


fasten bp aspects of evaluation are not peculiar to social action 
ees ig ey do, however, assume more obvious importance as an 
onal Program grows in size and number of jurisdictions covered: 


the bigger it is, the 7 
political competition. greater the likelihood for the overt appearance of 
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There is another sense in which evaluation is political, for some 
programs explicitly aim to redistribute resources or power; although this 
includes such things as school consolidation, social action programs are 
the best recent example. They were established by a political institution 
(the Congress) as part of an effort to change the operating priorities 
of state and local governments and thus to change not only the balance 
of power within American education but also the relative status of economic 
and racial groups within the society. One important feature of the new 
social action programs, then, is their political origin; another is their 
embodiment of social and political priorities which reach beyond the 
schools; a third is that their success would have many far-reaching political 
consequences. 


action programs, however, the political importance of information is 
raised to a high level by the broader political character of the programs 
themselves. ) 

This should be no surprise. Information assumes political importance 
within local school jurisdictions, but political competition among ool 
jurisdictions usually involves higher stakes—and the social action programs 
promote competition among levels of government. These programs are 
almost always sponsored by state or federal government, with at least 
the implicit or partial intent of setting new priorities for state and local 
governments. In this situation evaluation becomes a political instrument, 
a means to determine whether the new priorities are being met and to 
assess the differential effectiveness of jurisdictions or schools in meeting them. 
As a result, evaluation is affected by the prior character of intergovern- 
mental relations. State resistance to federal involvement, for example, 
pre-dates recent efforts to evaluate and assess federal social action programs. 
The history colors the evaluation issue and the state response reflects the 
prior pattern of relations, for evaluation is correctly seen as an effort to 
assert federal priorities, Evaluation also can affect patterns of intergovern- 
mental relations for it can help consolidate new authority for the super- 
ordinate government. In general, however, evaluation seems to reflect 
the established pattern of intergovernmental relations. ; 

Of course, not all the novelties in evaluating large-scale social action 
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programs are political. There are serious logistical difficulties—the programs 
are bigger than anything ever evaluated in education, which poses unique 
problems—and there is no dearth of methodological issues. These mostly 
center around making satisfactory comparisons between “treated” subjects 
and some criterion presumed to measure an otherwise comparable condi- 
tion of non-treatment. These are difficult in any program with multiple 
criterion variables, and when the program is spread over the entire country 
the problems multiply enormously. But difficult though these issues may 
be, they are in all formal respects the same, irrespective of the size, age, 
aim, or outcome of the program in question. The large-scale programs 
do not differ in some formal property of the control-comparison problem, 
but only in its size. The bigger and more complicated the programs, the 
bigger the associated methodological headaches. What distinguishes the 
new programs are not the formal problems of knowing their effects, but 


the character of their aims and their organization. These are essentially 
political. 


The politics of social action programs produce two sorts of evaluation 
problem. Some are conceptual—the programs’ nature and aims have not 
been well understood or adequately expressed in evaluation design. Others 
are practical—the interested parties do not agree on the ordering of priorities 
which the programs embody. As a result of the first, evaluation is miscon- 


ceived; as a result of the second, evaluation becomes a focus for expressing 
conflicting political interests, 


Conceptual Problems 


The central conceptual difficulty can be simply summarized: while the 
new programs seek to bring about political and social change, evaluators 
generally approach them as though they were standard efforts to produce 
educational change. This results in no small part from ambiguity of the 
Programs—since they are political endeavors in education, the program 
content and much of the surrounding rhetoric is educational. It also occurs 
because evaluation researchers identify professionally and intellectually 
with their disciplines of origin (mostly education and psychology), and thus 
would rather not study politics, They prefer education and psychology; 
= that is what they know, what their colleagues understand, and—if 

one well—what will bring them distinction and prestige (Dentler, 1969). 

But whatever the sources of the incongruence, it produces inappropriate 
evaluation. The aims and character of the programs are misconceived, 
and as a result evaluation design and execution are of limited value. Title 
I of ESEA (U. S. Congress, 1965a) is a good example with which to begin. 

In the four years ESEA has been in existence, the federal government 
completed several special evaluation studies, undertaken either by the 
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Office of the Secretary of HEW or the Office of Education.* They concen- 
trated mainly on one question—has the program improved achievement 
over what otherwise might have been expected? The answer in each case 
was almost entirely negative, and not surprisingly, this led many to conclude 
that the Title I program was not “working.” This, in turn, raised or 
supported doubts about the efficacy of the legislation or the utility of com- 
pensatory education. Yet such inferences are sensible only if two crucial 
assumptions are accepted: 


(a) children’s achievement test scores are a sufficient criterion of 
the program’s aims—the consequences intended by the govern- 
ment—to stand as an adequate summary measure of its 
success; and 

(b) the Title I program is sufficiently coherent and unified to warrant 
the application of any summary criterion of success, be it achieve- 
ment or something else. 


Both assumptions merit inspection. 


It does not seem unreasonable to assume that improving the achieve- 
ment of disadvantaged children is a crucial aim of the Title I program. 
Much of the program’s rhetoric suggests that it seeks to reduce the high 
probability of school failure associated with poverty. Many educators and 
laymen regard achievement test scores as a suitable measure of school 
success, on the theory that children with higher achievement will have 
higher grades, happier teachers, more positive attitudes toward school, and 
therefore a better chance of remaining and succeeding. 

There are, however, two difficulties with this view. One is that 
achievement scores are not an adequate summary of the legislation’s diverse 
aims. The other is that hardly anyone cares about the test scores themselves 
—they are regarded as a suitable measure of program success only because 
they are believed to stand for other things. 

The second point can easily be illustrated. Aside from a few intellec- 
tuals who think that schooling is a good thing in itself, people think test 
Scores are important because they are thought to signify more knowledge, 
which will lead to more years in school, better job opportunities, more 
money, and more of the ensuing social and economic status Americans 
Seem to enjoy. Poor people, they reason, have little money, undesirable 
jobs (if any) and, by definition, the lowest social and economic status 
Presently available. The poor also have less education than most of their 


*The National evaluati f Title I are little more than annual reports based on the 
State evaluation rears ‘which are little more than compilations of LEA reports. This 
is not to say that the reports are useless—but simply that they are not evaluations. 
The Office of the HEW Secretary was responsible for a study by Tempo, 1968. Also, 
see Piccariello, undated. 
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countrymen. On the popular assumptions just described, it is easy to argue 
that “poverty can be eliminated” by increasing the efficiency of education 
for the poor. 

Although much abbreviated, this chain of reasoning is not a bad state- 
ment of the reasons why improved achievement is an aim of Title I. 
Improved schooling was a major anti-poverty strategy, and higher school 
achievement simply a proxy for one of the program’s main aims— 
improving adults’ social and economic status. The principal problem this 
raises for evaluation is that the criterion of program effectiveness is 
actually only a surrogate for the true criterion. This would pose no difficulty 
if reliable estimates of the causal relationship between schoolchildren’s 
achievement and their later social and economic status existed. Unfor- 
tunately, no information of this sort seems to be available. There is one 
major study relating years of school completed and occupational status; 
it shows that once inherited status is controlled, years of school completed 
are moderately related to adult occupational status (Blau and Duncan, 
1968). Other studies reveal no direct relationship between intelligence and 
occupational status, but they do show that the education-occupation rela- 
tionship is much weaker for. Negroes than whites (Duncan, 1968). The 
first of these findings should not encourage advocates of improved achieve- 
ment, and the second is hardly encouraging to those who perceive blacks 
as a major target group for anti-poverty programs. 

There are studies which show that more intelligent people stay in 
school longer (Duncan, 1968), but it is hardly clear a priori that raising 
achievement for disadvantaged children will keep them in school, nor is 
it self-evident that keeping poor children in school longer will get them 
better jobs.* It is, for example, not difficult to imagine that the more 
intelligent children who stay in school longer do so because they also 
have learned different behavior patterns, which include greater tolerance 
for delayed gratification, more docility, less overt aggression, and greater 
persistence. Several compensatory programs are premised on these notions, 
rather than the achievement-production idea. Without any direct evidence 
on the consequence of either approach, however, it is difficult to find 
a rational basis for choice, 

This does not mean that compensatory education programs founded 
on either view are a mistake—absent any data, one could hardly take 
that position. It does suggest, however, that using achievement—or any 
other form of school behavior—as a proxy for the actual long-range 
purpose of compensatory education is probably ill-founded. The chief 
difficulty with this variety of agnosticism, of course, is that the only 
“By “achievement,” I mean measures 


include therein more specialized 
science, or driver thay cake 


of reading or general verbal ability; 1 do not 
ures of achievement such as math, social studies, 
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alternative is evaluation studies whose duration would make them of 
interest only to the next generation. What is more, they would be extremely 
expensive. The current proportion of program budgets devoted to evalua- 
tion indicates that the probability of undertaking such studies is nil. 

Even if this scientific embarrassment were put aside, there is the other 
major difficulty with using achievement as an evaluation criterion. School- 
men must be expected to assume that the greater application of their efforts 
will improve students’ later lives, but there is no evidence that the Congress 
subscribed to that view by passing Title I of the 1965 ESEA. Although the 
title did contain an unprecedented mandate for program evaluation—and 
even specified success in school as a criterion—this is scant evidence that 
the sole program aim was school achievement. The mandate for evaluation 
—like many Congressional authorizations—lacked any enabling mechanism: 
responsibility for carrying out the evaluation was specifically delegated to 
the state and local education authorities who operated the programs. It 
was not hard to see, in 1965, that this was equivalent to abandoning much 
hope of useful program evaluation. 

The main point, however, is that the purposes of the legislation were 
much more complex: most of them could be satisfied without any evidence 
about children’s achievement. Certainly this was true for aid to parochial 
school students, and it most likely was also true for many of the poorer 
school districts: for them (as for many of the congressmen who voted for 
the act) more money was good in itself. Moreover, the Congress is typically 
of two minds on the matter of program evaluation in education—it sub- 
scribes to efficiency, but it does not believe in Federal control of the schools. 
National evaluations are regarded as a major step toward Federal control 
by many people, including some members of Congress. 

Although the purposes of the Congress may be too complicated to 
be summarized in studies of test scores, they are not by that token 
mysterious. The relevant Committee hearings and debates suggest that 
the legislative intent included several elements other than those already 
mentioned.* One involved the rising political conflict over city schools 
in the early 1960’s; many legislators felt that spreading money on troubled 
waters might bring peace. Another concerned an older effort to provide 
federal financial assistance for public education: the motives for this were 
mainly political and ideological, and were not intimately tied to achieve- 
ment. A third involved the larger cities; although not poor when compared 
to the national average expenditure, they were increasingly hard-pressed to 
maintain educational services which were competitive with other districts 
in their areas as property values declined, population changed, and costs 


*A good general +. Bailey and Mosher, 1968. See also: U. S. Congress, 1965b; 
U.S. Howe Eiucetion ail Pos mmittee, 1965a and b; U. S. Senate Appropriations 
mmittee, 1965; and U, S. Senate Labor and Public Welfare Committee, | 
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and taxes rose. Educators and other municipal officials were among the 
warmest friends of the new aid scheme, because it promised to relieve some 
of the pressure on their revenues. 


Indeed, many purposes of the legislation—and the Congress’s implicit 
attitude toward evaluation—can be summarized in the form which it gave 
to fund apportionment. Title I is a formula grant, in which the amount of 
money flowing to any educational agency is a function of how many poor 
children it has, not of how well it educates them. In a sense, Title I is the 
educational equivalent of a rivers and harbors bill. There is no provision 
for withdrawing funds for non-performance, nor is there much suggestion 
of such intent in the original committee hearings or floor debates. Given 
the formula grant system, neither the Federal funding agency or the states 
have much political room to maneuver, even if they have the results of 
superb evaluation in hand. Without the authority to manipulate funds, 
achievement evaluation results could only be used to coax and cajole 
localities: the one major implicit purpose of program evaluation—more 
rational resource allocation—is seriously weakened by the Title I formula 
grant system. 

It is, therefore, difficult to conclude that improving schools’ production 
of poor children’s achievement was the legislation’s major purpose. The 
legislative intent embraced many other elements: improving educational 
services in school districts with many poor children, providing fiscal relief 
for the central cities and parochial schools, reducing discontent and con- 
flict about race and poverty, and establishing the principal of federal 
responsibility for local school problems. The fact that these were embodied 
in a single piece of legislation contributed heavily to its passage, but it 
also meant that the resulting program was not single-purpose or homo- 
geneous. If any supposition is in order, it is precisely the opposite. Title 
I is typical of reform legislation in a large and diverse society with a 
federal political system: it reflected various interests, decentralized power, 
and for these reasons a variety of programatic and political priorities. 


Additional References: Bateman, 1969; Campbell, 1969; Campbell and 
Stanley, 1966; Dyer, in press; Evans, 1969; Hyman and Wright, 1966; 
Marris and Rein 1967; McDill, McDill and Spreche, 1969; Rivlin, 1969; Riv- 
lin and Wholey, 1967; Rothenberg, 1969; Swartz, 1961; Weiss and Rein, 
1969; Wholey, 1969 a and b; Williams and Evans, 1969. 


Consequences for Evaluation 


Misconceptions about program aims result in omissions in evaluations 
aay in distortions of the relationship among various aspects of evaluation. 
e first problem is mainly confined to program delivery. For Title I, 
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for example, there are several criteria of program success which appear 
never to have been scrutinized. One involves impact of Title I on the fiscal 
position of the parochial schools: as nearly as I can tell, this purpose of 
the Act has never been explored.* Another involves the impact of Title 
I upon the fiscal situation of the central cities and their position vis-a-vis 
adjacent districts. Although the redistributive intent of the title was clear, 
there is little evidence of much effort to find out whether it has had 
this effect. With the exception of one internal Office of Education paper— 
which showed that Title I had reduced the per pupil expenditure disparity 
between eleven central cities and their suburbs by about half—this subject 
appears to have received no attention.** A third involves the quality of 
education in target as compared with non-target schools. Thus far no 
data have been collected which would permit an assessment of Title Ps 
effectiveness in reducing intra-district school resource disparities, although 
this was one of the most patent purposes of the Act. There has been an 
extended effort, covering 465 school districts (Project 465) to gather 
information on resource delivery to Title I target schools. This might turn 
up interesting data on differences in Title I services among schools, 
districts, or regions, but comparison with schools which do not receive 
Title I aid is not provided. Since Title I seeks to provide better-than-equal 
education for the disadvantaged, measuring its impact upon resource dis- 
parities between Title I and non-Title I schools within districts would 
be crucial. This is recognized in the Office of Education regulations govern- 
ing the Title, which provide that Title I funds must add to existing fiscal 
and resource equality between Title I and non-Title I schools.*** Important 
as this purpose of the legislation is, only a few federal audits have been 
conducted; for the most part, states satisfy the federal requirements simply 
by passing on data provided by the local education agencies, most of 
which are so general they are useless. 

Such things do not result simply from administrative lapses. Evidence 
on whether Title I provides better-than-equal schooling would permit a 
clear judgment on the extent to which Federal priorities were being met. 
But the legislation allocates money to jurisdictions on a strict formula, and 
it delegates the responsibility for monitoring performance to those same 
jurisdictions: this reflects both the decentralization of power in the national 
school system and the sense of the Congress that it should remain just so. 


— 

*The most recent ational Advisory Council on the Education of Dis- 
advantaged acent repot oe a brief section on this issue. It does not deal 
with Program impact, qs with private-parochial school relations. 

*Jackson, P. B. (1969), Hartman (undated) concluded that the aims of Title I are so 
Meee as to make the act little more than a general (i.e., non-categorical) vehicle for 
ie “stributing educational revenues. 

The requirement is found in U. S. Dept. of Health, Education, and Welfare (1968). 
There also is a special memorandum (Howe, 1968) covering this issue. 
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An important source of inadequate program delivery studies is inadequate 
Federal power or will to impose its priorities on states and localities; the 
priorities are enunciated in the statute, but the responsibility for determin- 
ing whether they are being met is left with the states and localities. 

The distribution of power is not the sole source of such problems in 
evaluating program delivery; the sheer size and heterogeneity of the society, 
and the unfamiliarity of the problems are also important. Project Headstart 
is illustrative. This program was not initially established within the existing 
framework of education. It existed mostly outside the system of public 
schools; its clients were below the age of compulsory education, and its 
local operating agencies often were independent of the official school 
agencies. Since the program came into existence, several million dollars 
have been spent on evaluation, and not a little of it on studies of program 
delivery. Yet it is still impossible to obtain systematic information on this 
subject. Several annual national evaluations and U. S. Census studies of 
program delivery are unpublished. But even if these studies had all been 
long since committed to print, they would only allow comparisons within 
the Headstart program. They would provide no basis for comparing how 
the services delivered to children under this program compare with those 
available to more advantaged children. That is no easy question to answer, 
but it is hardly trivial: without an estimate of this program’s efficacy in 
delivering services to children, its efficacy as an anti-poverty program could 
hardly be evaluated. 

__, There have been some recent efforts to remedy the relative absence of 
information on Title I program delivery, through an extensive management 
information program under development in 21 states, Data are to be 
collected from a sample of schools which receive Federal aid under several 
Programs; it is estimated that the universe of schools and districts from 
which the sample will be drawn includes roughly 90% of all public school 
students in those states, Extensive information on teachers, on district an 
school attributes, and funding will be collected from  self-administered 
questionnaires, Principals and teachers will provide information on school 
and classroom characteristics and programs, including compensatory efforts. 
- In elementary schools the teachers will provide information on student 
background, but in secondary schools these data may be taken from the 
students themselves. Some effort also will be made to measure the extent 
of individual student's exposure to programs. In addition, common testing 
(using the same instruments in all schools) is planned, beginning wit 
grades four and eleven. If this ambitious effort becomes operational in 
anything approaching the time planned, in a few years extensive data wi 
be available with which to assess program delivery for Title I. 
Aine ue then, an underlying purpose of social action programs 
is to deliver more resources to the poor, whether they are districts, schools, 
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children, or states. It is therefore essential to know how much more and 
for whom. It is important both because citizens should know the extent to 
which official intentions have been realized, and because without much 
knowledge on that score, it is hard to decide what more should be done. 
Satisfying these evaluative needs implies measurement that is both historical 
(keyed to the target population before the program began) and comparative 
(keyed to the non-target population).* 

Studies of program delivery also serve a building-block function with 
respect to evaluating program outcomes. Whatever criterion of program 
effect one might imagine, it could not intelligibly be evaluated in the 
absence of data which describe the character of the program. Improved 
health, for example, is a possible outcome of the health care components of 
Title I and Headstart: one could not usefully collect evidence on changes in 
students’ health without evidence on the character and intensity of the 
care they received from the program. And if one were interested in the 
impact of health care on school performance, it would be necessary to add 
some measure of students’ achievement, classroom behavior, or attitudes. In 
the case of achievement outcomes, of course, evaluators commonly try to 
associate information about the type and intensity of academic programs 
with students’ scores on some later test. 

Despite the logical simplicity of these relationships, it is not easy to find 
large-scale social action programs in which outcome evaluation is linked to 
appropriate program delivery data. Without any direct evidence on program 
delivery, the only “input” which can be evaluated is inclusion in a program. 
But an acquaintance with national social action ventures leads quickly to 
the conclusion that an important aspect of such endeavors is the “non- 
treatment project.” There is no reason to believe that mere inclusion 
necessarily leads to change either in the substance of education or in the 


*Not all the purposes of social action programs are so neat or abstract, nor can they 
all be evaluated br counting dollars, teachers, or special programs. One of the aims of 
large-scale social action programs is to produce peace, or at least to reduce conflict. 
hether or not they serve these ends is well worth investigating. Hp 
„„ Similarly, little is known about the ways in which educational institutions change. 
This has been highlighted by the ability of many big-city school systems to absorb 
large amounts of activity and money designed to change them, and emerge apparently 
unchanged. If any question about the efficac of social action programs is crucial, it is 
how such efforts at change succeed or fail. The requirement here may not be quantita- 
tive research, but political and social analysis, which follows the political and 
administrative history of social change programs. It n be possible to learn as much 
about the sources of programs’ success from studying the politics of their intent and 
execution as from analyzing the quantitative relationships between program components 
and some summary measure of target group rformance, Although such studies would 
= inapplicable in traditional educational evaluation, they a crucial in the ond 
ocial action pro; . These rams represent an effort to rearrange 
relationships, and the Sua of ssa in their success are therefore bound to have 
as much to do with political and administrative matters as with how efficiently 
Program inputs are translated into outcomes. 
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level of resources. This phenomenon takes many forms: it may consist of 
teachers or specialists who never see the target children; it may involve 
supplies and materials never unpacked, or educational goods and services 
which reach students other than those for whom they were intended; in 
still other cases it may consist of using program monies to pay for goods 
and services already in use. Whatever the specific form the non-treatment 
project takes, however, recognizing it requires extensive program delivery 
data. In a decentralized educational system the probability of such occur- 
rences must be fairly high, and the obstacles to discovering them are 
considerable. 


Even if the non-treatment project problem could be ignored, inadequate 
evidence on program delivery has other consequences for the evaluation 
of program outcomes, One of them is illustrated by the following excerpt 
from one Office of Education study of pre- and post-test scores in 33 big-city 
Title I programs (Piccariello, undated, p. 4): 


For the total 189 observations [each observation was one classroom 
in a Title I program], there were 108 significant changes (exceed 
2 s.e.), OF these 58 were gains and 50 were losses. In 81 cases the 
change did not appear to be significant. 


As the data in Appendix D show, success and failure seem to be 
random outcomes, determined neither clearly nor consistently by 
the factors of program design, city or state, area or grade level. 


When one reads Appendix D, however, he finds that the categorization 
by program design rests exclusively on one-paragraph program descriptions 
of the sort often furnished by the local project directors in grant applications. 
This makes it difficult to grasp the meaning of the study’s conclusions. 
Perhaps success and failure were random with respect to program content, 
but given the evidence at hand it is just as sensible to argue that program 
content is unrelated to project descriptions, and that some underlying 
pattern of causation exists.* Without evidence on program delivery, it is not 
easy to see what can be learned from evaluations of this sort. 

There is, however, an important counter-argument on this point. The 
recent Westinghouse evaluation of Project Headstart, for example, took as 
its chief independent variable inclusion in Headstart projects. The premise 
for this was that the government has a legitimate interest in determining 
whether a program produces the expected results, On this view, arguments 
about program delivery are irrelevant, since from the sponsoring agency’s 


*Actually, the evaluation found that 


ai. 
had low. soraan the Brot ant gains were more common among classrooms which 


t that losses were more common among those class- 
eflect (Piccariello, nantly . the most economical hypothesis, then, is a regression 
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perspective, inclusion in the program is of overriding interest. (See Evans, 
1969.) 

There certainly is no question that in principle over-all program 
evaluation is justified. But the principle need not lead to a single summary 
evaluation in practice. Judgments about a program’s over-all impact can 
just as well be derived from an evaluation which distinguishes program 
types or differentiates program delivery as from one which ignores them. 
From this perspective, rather than reporting whether the “average” Head- 
start project raised achievement, it would be more meaningful to identify 
the several program types and determine whether each improved achieve- 
ment. 


purpose is not to improve achievement. But if one reads any compilation 
of project aims in Title I, he finds that only a minority aim only to improve 
achievement. Another minority aims to improve something else, and a 
majority aim to do both, or more. Given this heterogeneity of aims and 
non-treatment project problem, one could not proceed on the basis of 
project descriptions—the stated purpose of improving achievement would 
have to be validated by looking at programs. The second step would be to 
distinguish the main approaches (the program types), from all those 
which actually sought to raise achievement. The main purpose of the 
evaluation is to distinguish the relative effectiveness of several approaches 
to this goal. 

But if the logic seems clear, the procedure does not. To empirically 
distinguish the class of projects aimed at raising achievement one must 
first know what it is about schooling that affects achievement. Only on the 
basis of such information would it be possible to sort out those projects 
whose execution was consistent with their aims from those which were not. 
But when the new programs were established very little was known on 
this point: prior compensatory education efforts were few, far between, and 
mostly failures, The legislation was not the fruit of systematic experimenta- 
tion and program development, but the expression of a paroxysm of concern. 
Although a good deal has been learned in the last four or five years, 

f techniques known to 
improve school achievement. The only way one can tell if a project is of 
the sort which improves achievement, then, is not to inspect the treatment, 
but to inspect the results. 

This creates an awkward situation. If there is no empirical typology of 
compensatory or remedial programs, what basis is there for distinguishing 
among programs? What basis is there for deciding which program character- 
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istics to measure—if one does not know what improves achievement, how 
does onc select the program attributes to measure? Some choice is essential, 
for evaluations cannot measure everything. 

These questions focus attention on one important attribute of the 
new programs. To the extent that they seek to affect some outcome of 
schooling, such as attitudes or achievement, they represent a sort of 
muddling-through—an attempt at research and development on a national 
scale. This is not a comment on the legislative intent, but simply a 
description of existing knowledge. If program managers and evaluators 
do not know what strategies will affect school outcomes, it is not sensible 
to carry out over-all, one-shot evaluations of entire national programs: 
the results of strategies which improved achievement might be canceled 
out by the effects of those which did not. If the point is to find what 
“works,” the emphasis should be on defining distinct strategies, trying 
them out, and evaluating the results. The highest priority should be 
maximum definition and differentiation among particular approaches. 
Program managers and evaluators must therefore devise educational treat- 
ments based on relatively little prior research and experience, carry them 
out under natural conditions, evaluate the results, and compare them with 
those from other similarly developed programs. Insofar as school outcomes 
are the object of evaluation, the work must take place in the context of 
program development and comparative evaluation. This requirement raises 
a host of new problems related to the intentional manipulation of school 
programs and organization within the American polity. 


Experimental Approaches 


The problem, then, is not only to identify what the programs deliver, 
but also to systematically experiment with strategies for affecting school 
outcomes, This idea has been growing in the Federal bureaucracy as 
experience with the social action programs reveals that the system of 
natural experiments (every local project does what it likes on the theory 
that good results would arise, be identified, and disseminated) has not 
worked. The movement toward experimentation presumes that the most 
efficient way to proceed is systematic trial and discard, discovering and 
replicating effective strategies. 

Under what conditions might social action programs assume a partly 
experimental character? For Title I this would not be easy, because the 
legislation did not envisage it. It is a major operating program, and several 
of its purposes have nothing to do with achievement. Activities in Wash- 
ington designed to carry out systematic research and development would 
generate considerable opposition among recipient state and local educational 
agencies, and in the Congress. Experimentation requires a good deal of 
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bureaucratic and political control, and there is little evidence of that. The 
Office of Education, for example, does not require that the same tests be 
used in all Title I projects—indeed, it does not require that any tests 
be used. The Title I program’s managers have neither the power nor the 
inclination to assign educational strategies to local educational agencies. 
Even if they did, the legislation would be at cross-purposes with such 
efforts. It aims to improve resource delivery—to ease the fiscal hardships 
of city and parochial schools, and to equalize educational resource dis- 
parities, Although the formula grant system is quite consistent with these 
aims, it is not consistent with experimentation. The two aims imply different 
administrative arrangements, reporting systems, and patterns of Federal- 
state-local relations. The experimental approach requires a degree of 
control over school program which seems incompatible with the other pur- 
poses of Title I. 

The question is whether other programs offer a better prospect for 
experimentation in compensatory education. In mid-1968, the White House 
Task Force on Child Development recommended that Federal education 
programs adopt a policy of “planned variation”; the Task Force report 
argued that no learning from efforts to improve education was occurring 
with existing programs, and that it would result only from systematic 
efforts to try out different strategies under a variety of school and com- 
munity conditions (White House Task Force, 1968). 

The Task Force report focused its attention on Project Follow-Through. 
Follow-Through was originally intended to extend Headstart services from 
preschool to the primary grades, but severe first year budgetary constraints 
had greatly reduced its scope. Largely for this reason, the program seemed 
a natural candidate for experimentation. The Task Force (1968) recom- 
mended that Federal officials select a variety of educational strategies and 
develop evaluation plans using common measures of school outcomes in all 
cases, 


The administration should explicitly provide budget and personnel 
allowances for a Follow-Through staff to stimulate and develop 


projects consistent with these plans . . - 


The Office of Education should select all new Follow-Through 
projects in accordance with these plans for major variation and 
evaluation. 


After three years, can the effort be termed a success? Since the program 
is still under development it would be unwise to deal with the strategies or 
their impact upon achievement. My concern is only with the quality of the 
evaluation scheme and the discussion is meant to be illustrative; the evalu- 
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ation design may change, but the underlying problems are not likely to 
evaporate. 

The Follow-Through program of experimentation is designed to deter- 
mine which educational strategies improve achievement over what might 
otherwise be expected, and what the relative efficiency of the strategies 
is. The program began with little knowledge about the determinants of 
academic achievement, and as a consequence, equating schools and programs 
becomes much more difficult. Assume, for example, that all the Follow- 
Through projects sought to change student achievement by changing 
teachers’ classroom behavior, but no two projects used the same treatment 
or attacked the same dimension of behavior. Suppose further that in half 
the projects achievement gains for students resulted. How could cue be sure 
that the gains derived from the Follow-Through strategies, and not from 
selection or other teacher attributes than those manipulated by the program? 
The obvious answer is to measure teacher attributes and use the data to 
“control” the differences. But, since the program begins with little knowledge 
of what it is about teachers and teaching that affects achievement, 
evaluators must either measure all the teacher attributes which might 
affect achievement or closely approximate an experimental design. The 
first alternative is logically impossible, for the phenomena are literally 
unknown. The second alternative poses no logical problems; it requires 
only that the Federal experimenter have extensive control over the assign- 
ment of subjects (schools, school systems, and teachers) to treatment. The 
problems it raises are administrative and political. 


The Follow-Through program has not been able to surmount them. 
Neither the districts nor the schools appear to have been selected in a 
manner consistent with experimental design. The districts were nominated 
by state officials; those nominated could accept or decline, and those who 
accepted could pretty much choose the strategy they desired from several 
alternatives. There was, then, room for self-selection. In addition, the 
purveyors of the strategies—the consultants who conceived, designed, and 
implemented or trained others to implement the strategies—seem also to 
have been recruited exclusively by self-selection. The usual ways of dealing 
with selection problems (never entirely satisfactory) seem even less helpful 
here. Although experimental and non-experimental schools could be com- 
pared to see if they differed in any important respects, the relevance of 
this procedure is unclear when little is known about what those “important 


a a (vis-à-vis improving education for disadvantaged children) happen 


The weight of the eval 
control groups. The presen 
and control classrooms, 
teachers’ background and 


uation strategy seems to fall on comparison or 
t plan calls for selecting a sample of treatment 
carrying out classroom observation, measuring 
attitudes, and using variables derived from these 
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measurements in multivariate analyses of student achievement. Most of 
the instruments are still under development. But since there is neither a 
compulsion nor an incentive for principals in non-Follow-Through schools 
to participate as controls, how representative will the control classrooms 
be? What is more, these control or comparison groups cannot serve as 
much of a check on selectivity among the participants. Many of the com- 
parison schools are in the same districts which selected themselves into 
the program and chose particular treatments; even those that are not are 
bound to be somewhat selected, because of the voluntary character of 
participation. Even if no “significant differences” are found between experi- 
mental and control schools, this would only prove that selected experimental 
schools are not very different from selected control schools. 

There also may be some confounding of Follow-Through with related 
programs. Follow-Through operates in schools which are likely to receive 
other federal (and perhaps state and local) aid to improve education for 
disadvantaged children. Students will have the benefit of more than one 
compensatory program, either directly or through generally improved 
services and program in their school. There is little evidence at the moment 
of any effort to deal with this potential source of confounding. 

In addition to selection, there are problems related to sample size. 
The design assumes that classrooms are the unit of analysis; this is 
appropriate, since they are the unit of treatment. Almost all measurements 
of program impact are classroom aggregates—i.e., they measure a class- 
room’s teacher, its climate, its teaching strategy, etc. But it seems that 
relatively few classrooms will be selected from each project for evaluation 
(the 1969-70 plans call for an average of almost five per project, distributed 
over grades K-2.) Since there are only a few classrooms per grade per 
project, it appears that in the larger projects there might be a dozen 
experimental units (classrooms) per grade. That is a very small number, 
especially when it is reasonable to expect some variation among classrooms 
on such things as teacher and student attributes, classroom styles, etc. In 
fact, since only six or eight of the strategies that are being tried involve 
large numbers of projects, the remaining strategies (more than half 
the total) probably will not have sufficient cases for much of an evaluation. 

There has been some effort to deal with this problem by expanding 
the student achievement testing to cover almost all classrooms in Follow- 
Through. This has not, however, been paralleled by expanded measure- 
ment of what actually is done in classrooms, a procedure which is helpful 
only on the assumption that there is no significant variation among 
classrooms within projects or strategies on variables related to the treatment 
or the effect. If this is true, of course, then measuring anything about the 
content, staff, or style of the classroom is superfluous—one need only 
designate whether or not it is an experimental or control unit. Despite 
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the evaluators’ view that this approach is warranted, the inclusion of some 
classrooms for which the only independent variable is a dummy (treatment 
nontreatment) variable seems dubious. This will inflate the case base 
and therefore produce more statistical “confidence” in the results, but it 
may so sharply reduce the non-statistical confidence that the exercise will 
be useless. The evaluators argue that given the fixed sum for evaluation, 
they cannot extend measurement of classroom content. 


Sample size problems are compounded because the evaluation is longi- 
tudinal. Since there is inter-classroom mobility in promotion (all classes 
are not passed on from teacher to teacher en bloc), following children 
for more than one year will sharply reduce the number of subjects for 
which two- or three-year treatment and effect measures can be computed. 
Add to this the rather high inter-school pupil mobility which seems to 
be characteristic of slum schools, and nightmarish anxieties about sample 
attrition result. Although nothing is certain at this point, there will be 
considerable obstacles to tracing program effects over time. 


A fourth problem relates to student background measures. Apparently 
these data are being collected only for a relatively small sample of families. 
The evaluators are not sure that it will be large enough to allow considera- 
tion of both project impact and family background variables at once, but 
budget constraints preclude expanding the sample. 

There are a few additional difficulties that merit mention, though 
they do not arise from the evaluation strategy, but from the nature of social 
action programs. There are reports of “leakage” of treatments from Follow- 
Through to comparison schools in some communities; there also seem to 
have been shifts in Program goals in some projects, and apparently there 
has been conflict in definition of aims between the Follow-Through 
administration and some projects. In fact, there appears to be an element of 
non-comparability emerging among the strategies. Some involve very broad 
approaches, whose aims center around such things as parent involvement 
in or control of schools; others are more narrowly-defined and research- 
based strategies for improving cognitive growth. As long as traditional 
evaluation questions are asked (did treatment produce different results 
than an otherwise comparable non-treatment?), this poses no problem, but 
comparing treatments is close to the heart of Follow-Through. It is difficult 
to see how such comparative questions can be answered when the programs 
are so diverse. The general change programs, for example, appear to be 
hostile to the idea of evaluations based on achievement. They seem to be 
moving toward establishing other outcomes—“structural change” in schools, 
for example—as the Program aim of primary concern. This heterogeneity 


of aims may well restrict the s i i Follow- 
Throaghe i e scope of comparative analysis for Fol 
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The common element in all these difficulties is that the Office of 
Education is largely powerless to remedy them. Random assignment of 
schools to treatments and securing proper control groups are the most 
obvious cases; lack of funds to generate adequate samples of experimental 
classrooms or parents are other manifestations of the same phenomenon. 
Although there is no doubt that some problems could have been eased by 
improved management, no amount of forethought or efficiency can produce 
money or power where there is none. Nor is it easy to see how the Office 
of Education could effectively compel project sponsors not to change some 
aspects of their strategies or not to alter their motion of program aims. 

The experience thus far with Follow-Through suggests, then, that 
the serious obstacles to experimentation are political: first, power in the 
educational system is almost completely decentralized (at least from a 
national perspective), and federal experimentation must conform to this 
pattern; second, the resources allocated to eliminating educational dis- 
advantage are small when compared to other federal priorities, which 
indicates the government’s relatively low political investment in such efforts. 
Consequently, federal efforts to experiment begin with a grave deficit 
in the political and fiscal resources required to mount them, and there is 
little likelihood of much new money or more power with which to redress 
this imbalance. These difficulties are not peculiar to evaluation: they result 
from the same conditions which make it difficult to mount and operate 
effective reform programs. The barriers to evaluation are simply another 
manifestation of the obstacles to federally-initiated reform when most 
power is local and when reform is a relatively low national priority. 

Several dimensions of social action program evaluation emerge from 
this analysis. My purpose here is not to provide a final typology of evalua- 
tion activities, but simply to suggest the salient elements. First among these 
is the identification of program aims; this ordinarily will involve the recog- 
nition of diversty, obscurity, and conflict within programs, and greater 
attention to program delivery. Evaluators of social action programs often 
complain that the programs lack any clear and concise statement of aims, 
a condition which they deplore because it muddies up evaluations. ‘Their 
response generally has been to bemoan the imprecision and fuzzy-minded- 
ness of the politicians and administrators who establish the programs, and 
then to choose a summary measure of program accomplishment which 
satisfies their more precise approach. I propose to stand this on its head 
and question the intellectually: fuzzy single-mindedness of much educa- 
tional evaluation. It generally has not grasped the diverse and conflicting 
nature of social action programs, and therefore produces unrealistically 
Constrained views of program aims. 

The second element is clarity about the social and political framework 
of measurement, In traditional evaluation the ideal standard of comparison 
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(the control group or pre-measure) is one that is just like the treatment 
group in all respects except the treatment. But in social action programs 
the really important standard of comparison is the non-treatment group— 
what one is really interested in is how much improvement the program 
produces relative to those who do not need it. As a result the evaluation 
of social action programs is essentially comparative and historical, despite 
its often quantitative character: it seeks to determine whether a target 
population has changed, relative both to the same population before the 
program began and to the non-target population. 

Finally, the evaluation of social action programs in education is 
political. Evaluation is a technique for measuring the satisfaction of public 
priorities; to evaluate a social action program is to establish an information 
system in which the main questions involve the allocation of power, status, 
and other public goods. There is conflict within the educational system 
concerning which priorities should be satisfied, and it is transmitted, willy- 
nilly, to evaluation. This puzzles and irritates many researchers; they 
regard it as extrinsic and an unnecessary bother. While this attitude is 
understandable; it is mistaken. The evaluation of social action programs 
is nothing if not an effort to measure social and political change. That is 
a difficult task under any circumstances, but it is impossible when the 
activity is not seen for what it is. 


Suggestions for an Evaluation Strategy 


What might be the elements of a more suitable evaluation strategy? 
The answer depends not only on what one thinks should be done, but on 
certain external political constraints. There is, for example, good reason 
to believe that federal education aid will be shifted into the framework 
of revenue sharing or block grants in the near future, and this is unlikely 
to strengthen the government's Position as a social or educational experi- 
menter. The only apparent alternative is continuing with roughly the 
present balance of power in education as categorical aid slowly increases 
the federal share of local expenditures. This seems unlikely to improve 
the government's position in the evaluation of large-scale social action 
Programs. In addition, there seems to be a growing division over the 
criteria of program success. Researchers are increasingly aware that little 
evidence connects the typical criteria of program success (high achievement 
and good deportment » With their presumed adult consequences (better 
job, higher income, ete.), More important, in the cities—particularly in 

e Negro community—there is rising opposition to the view that achieve- 
ment and good behavior are legitimate criteria of success. Instead, political 
legitimacy—in the form of parent involvement or community control— 
is advanced as a proper aim for school change programs. It is ironic that 
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the recent interest in assessing schools’ efficiency—which gained much of 
its impetus from black discontent with white-dominated ghetto schools— 
now meets with rising opposition in the Negro community as blacks seek 
control over ghetto education. Nonetheless, this opposition is likely to 
increase, and the evaluation of social action programs in city schools is 
sure to be affected. 

There is, then, little reason to expect much relaxation of the political 
constraints on social action program evaluation. This suggests two principles 
which might guide future evaluation: experiment only when the substantive 
issues of policy are considerable and reasonably well-defined; reorient 
evaluation of the non-experimental operating programs to a broad system 
of measuring status and change in schools and schooling. 

The first principle requires distinctions among potential experiments 
in terms of the political constraints they imply. One would like to know, 
for example, which pre-school and primary programs increase cognitive 
growth for disadvantaged children; whether giving parents money to 
educate their children (as opposed to giving it to schools) would improve 
the children’s education; whether students’ college entrance would suffer 
if high school curriculum and attendance requirements were sharply reduced 
or eliminated; whether school decentralization would improve achieve- 
ment; or whether it would raise it as much as doubling expenditures. 

These are among the most important issues in American education, 
but they are not equally difficult when it comes to arranging experiments 
to determine the answers. In most cases, large scale experimentation would 
be impossible. Experiments with decentralization, tuition vouchers, doubling 
per-pupil expenditures, and radical changes in secondary education have 
two salient attributes in common: to have meaning they would have to 
be carried out in the existing schools, and few schools would be likely to 
oblige. If experimentation occurs on such issues it would be limited—a 
tentative exploration of new ideas involving small numbers of students and 
schools. While this is highly desirable, it is not the same thing as mounting 
an experimental social action program in education. 

This may not be the case with one issue, however—increasing cognitive 
growth for disadvantaged children. It already is the object of several social 
action programs and would not be a radical political departure. As a result 
of prior efforts, enough may be known to permit comparative experimental 
studies of different strategies for changing early intelligence. Several alterna- 
tive approaches can be identified: rigid classroom drill, parent training, 
individual tutoring at early ages, and language training. Relatively 
little is known about the processes underlying these approaches, but 
there may be enough practical experience to support systematic com- 
parative study. Since all the strategies have a common object, researchers 
Probably could agree on common criterion measures. Since cognitive growth 
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is widely believed to be crucial, investment in comparative studies seems 
worthwhile. But if such studies were undertaken, they should determine 
whether the treatment effect itself (higher IQ) is only a proxy for other 
things learned during the experiment, such as academic persistence, good 
behavior, and whether cognitive change produces any change on other 
measures of educational success, such as grades or years of school completed. 


In effect, the chances for success of experimental approaches to social 
action will be directly related to a program’s political independence, its 
specificity of aim, and its fiscal strength. The less it resembles the sort of 
broad-aim social action Programs discussed in this paper, the more appro- 
priate is an experimental approach and the more the “evaluation” looks 
like pure research. Of course, the further one moves down this continuum 
the less the program’s impact is, and the less relevant the appelation “social 
action.” Early childhood programs may be the only contemporary case in 
which the possibility of large-scale experimentation does not imply political 
triviality. 

The second principle suggested above implies that the central purpose 
of evaluating most social action programs is the broad measurement of 
change. Evaluation is a comparative and historical enterprise, which can 
best be carried out as part of a general effort to measure educational status 
and change. The aims of social action programs are diverse, and their 
purpose is to shift the position of specified target populations relative to 
the rest of the society; their evaluation cannot be accomplished by isolated 
studies of particular aims with inappropriate standards of comparison. 


Evaluating broad social acti i 
a on programs requires comparably broad systems 
of social measurement, iy i : 


A measurement system of this sort would be a census or system of 
social indicators of schools and schooling (not education). It would cover 
three realms; student, personnel, program, and fiscal inputs to schools; 
several outcomes of schooling, including achievement; temporal, geographic, 
political and demographic variation in both categories. If data of this sort 
were collected on a regular and recurring basis, they would serve the 
main evaluation needs for such Operating programs as Title I. They would, 
for example, allow measurement of fiscal and resource delivery and of 
their Variations over time, region, community, and school type. They would 
permit measurement of differences in school outcomes, as well as their 
changes over time, If the measurement of school outcomes were common 
over all schools, their variations could be associated with variations a 

ing those of students, school resources, an 
‘ = content and character of roel programs. Finally, if the measurement 
i RES and resources were particular to individual students, many 
of these comparisons could be extended from schools to individuals. 
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One advantage of such a measurement system would be its greater 
congruence with the structure and aims of large-scale multi-purpose pro- 
grams. Another is that it would be more likely to provide data which could 
be useful in governmental decision-making—which, after all, is what evalua- 
tion is for, Most evaluation research in programs such as Title I is decen- 
tralized, non-recurring, and unrelated to either program planning or 
budgeting; as a result of the first attribute it is not comparable from com- 
munity to community; as a result of the second it is not comparable from 
year to year; and as a result of the third it is politically and administratively 
irrelevant. Since the main governmental decisions about education involve 
allocating money and setting standards for goods, services, and performance, 
evaluation should provide comparable, continuing, and cumulative informa- 
tion in these areas, That would only be possible under a regular census of 
schools and schooling. 


This is not to say there would be no deficiencies; there would be 
several, all of which are pretty well given in the nature of a census. By 
definition a census measures stasis, it quantifies how things stand. If done 
well, it can reveal a good deal about the interconnection of social structure; 
if it recurs, it can throw much light on how things change. But no census 
can reveal much about change other than its patterns—probing its causes 
and dynamics requires rather a different research orientation. And no census 
can produce qualitative data, especially on such complicated organizations 
as schools. There is, however, no reason why qualitative evaluation could 
not be systematically related to a census. Such evidence is much more 
useful when it recurs, and is connected with the results of quantitative 
studies. The same is true of research on the political dynamics and conse- 
quences of social action programs. Although valuable in itself, its worth 
would be substantially increased by relating it to other evidence on the 
same program. 

The central problem, however, is experimentation. Using a census 
as the central evaluation device for large-scale multi-purpose programs 
assumes that systematic experimentation is very nearly impossible within 
the large operating programs and can best be carried on by clearly dis- 
tinguishing census from experimental functions. It would be foolish to 
ignore experimentation—it should be increased—but it would be illusory 
to try to carry it out within programs which have other purposes. A clear 
view of the importance of both activities is unlikely to emerge until they 
have been distinguished conceptually and pulled apart administratively. 

These suggestions are sketchy, and they leave some important issues 
open. Chief among them are the institutional and political arrangements 
required to mount both an effective census of schools and schooling and 
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a long-term effort in experimentation.* Nonetheless, my suggestions do 
express a strategy of evaluation, something absent in most large-scale 
educational evaluation efforts. The strategy assumes that government has 
two distinct needs, which thus far have been confounded in the evaluation 
of large-scale action programs. One is to measure status and change in the 
distribution of educational resources and outcomes; the other is to explore 
the impact and effectiveness of novel approaches to schooling. If the first 
were undertaken on a regular basis the resulting time-series data would 
provide much greater insight into the actual distribution of education in 
America. It would thus build an information base for more informed 
decisions about allocation of resources, at both the state and Federal level. 
If the second effort were undertaken on a serious basis, it should be possible 
to learn more systematically from research and development. Perhaps the 
best way to distinguish this strategy from existing efforts is this: were the 
present approach to evaluating social action programs brought to perfection, 
it would not be adequate—it would not tell us what we need to know about 
the programs. 

Second, my suggestions assume that the evaluation of social action 
programs is a political enterprise. This underlies the idea of separating 
experimentation from large-scale operating programs. It also underlies the 
notion of a census of schools and schooling, which would almost compel 
attention to the proper standards of comparison and would emphasize 
the importance of change. In addition, only a broad system of measure- 
ment can capture the political variety which social action programs embody. 
Perhaps most important, measuring the impact of social change programs 
in this way is not tied to a particular program or pattern of Federal aid. 

_ Finally, such a strategy could be implemented within the existing 
political constraints. That is not a scientific argument, but that is the real 
point: evaluating social action programs is only secondarily a scientific 
enterprise. First and foremost it is an effort to gain politically significant 


*Creating the capacity for experimentation would involve a few major decisions. One 
probably would be to separate the activity from the Office of Hincation, retaining its 
connection with HEW at the Assistant Secretary level. Another would be to create 
bem yegoawn capacity for support and evaluation in the private (or quasipublic) 
: supe at b: e moment, this important resource is not well enou developed to bear the 
si . A third would be to so arrange its management that the decisions about what 
Hye rach fund resulted from systematic and sustained interaction between the 
Gatton ate, hg reese its scientific staffers and constituents, ot te 
. Without this, it wi i itically 

unimportant work ould probably do interesting but polit 
reating the capacity for a census or system of schooling indicators involves 
o meen Here there is good reason for it to be part of USOE. The question is 
thant ani ile Leena would be required; though these require more detailed wor 
Ee is re Ate ere, a bit of speculation is possible. There is some reason to think, 
K camp le, that the new 21 state management information system might be a pe 
ase from which to begin, There already is an ongoing program, it seems to have 

promise conceptually, and there seems to be a good state-Federal relationship. 
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information on the consequences of political acts. To confuse the technology 
of measurement with the real nature and broad purposes of evaluation 
will be fatal. It can only produce increasing quantities of information in 
answer to unimportant questions. 
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+ 3: CURRICULUM EVALUATION 


IAN WESTBURY* 
University of Chicago 


Curriculum evaluation appeared as a topic of a 
chapter in three of the five issues of the 1969 Review of Educational Re- 
search. The emphasis on this topic is, if nothing else, disconcerting to a 
reviewer who must plow the same field again; it is also puzzling when com- 
pared with the infrequent appearances of evaluations of actual curricula or 
curricular materials in either the research or the subject journals. AERJ, for 
example, contained no papers in its last three volumes (67-69) that might 
be counted as an evaluation of a curriculum, a curricular preseription (such 
as Montessori or Headstart), or curricular materials. This is perhaps not 
surprising given the character of AERJ, but when this same finding was 
obtained after a survey of School Review, Harvard Educational Review, 
Social Education, Science Teacher, College English, College Composition 
and Communication, Research in the Teaching of English, and Theory into 
Practice, the contrast with the concerns reflected in the Review of Educa- 
tional Research is striking. These journals do have short reviews of texts 
and notes on curricular problems, but nothing equivalent to the preoccupa- 
tion of the Review with this one theme. 

This is perhaps not unusual. Evaluations exist in the files and reports 
of those who developed curricula. Yet, while these evaluations remain in 
files, the proposals and prescriptions of developers circulate freely, without 
any readily available critical scrutiny. There is a literature of curriculum 
evaluation, but it is neither publicly available in journals nor has it grown 
out of an accessible tradition of formal or informal appraisal of curricula. 
There is no “consensus of public knowledge” (Ziman, 1968, ch. 6) on the 
nature of curriculum evaluation which warrants methodological formaliza- 
tions about its character or provides the substance of such fermalizations. 

Curriculum evaluators are also experiencing difficulties in their relations 
to developers: “Evaluation of the wrong kind, at the wrong time, and for 
the wrong reasons has characterized too much of the current effort to 
appraise educational reforms. Meaningless evaluation is ruining the cutting 


*Urban $. Dahlléf, University of Göteborg, Sweden; Maurice Eash, University of 
Hlinois; P. Utah State University; and Joel Weiss, Ontario Institute for 
Studies Pag sede as consultants to pa Westbury on the preparation of this 
chapter. 
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edge of educational innovation” (National Advisory Council on Education 
Professions Development, undated, p. 1). Although this is not a repudiation 
of evaluation, it is a serious charge. It raises questions about what might be 
the right kind, the right time, and the right reasons for evaluation. One 
could make a case for treating these as open questions and for exploring the 
possibility of testing, by means of a review, the most basic assumptions of 
curriculum evaluation as it is preached and practiced. What is this “cur- 
riculum evaluation” that has no public face, no tradition of informal 
practice and, possibly, no happy relationship with the world represented by 
the “cutting edge of educational innovation”? 

The assertion that “we must evaluate our curricula” in terms of cost, 
effectiveness, content, and the like has a ring of sense and efficiency and a 
commonplace obviousness that make it impossible to believe the opposite. 
To this extent everyone supports the evaluation of curricula. Curriculum 


evaluation is, however, another thing; it is a body of techniques, method-, 


ologies, and principles created deliberately (and recently) to give some 
systematic form to the ways in which the assertion “we must evaluate . . .” 
can be made to work. As such, curriculum evaluation represents a program 
for action that rests, ambiguously, on a sense of what the evaluation of 
curricula could imply and on the ways in which particular stipulative 
definitions about curriculum evaluation should be understood (Scheffler, 
1960, ch. 1). 

Both of these foundations for a curriculum evaluation are of uncertain 
quality. Curriculum and evaluation are vague concepts and it is far from 
obvious what can be made clear by the juxtaposition of ambiguities. Pre- 
definitional usages and uncertainties must enter into any attempts to make 
sense of new concepts created in this way and must interfere with the 
reception of new stipulations. The consequences of these confusions can 
only be further confusion; curriculum workers, development administrators, 
teachers, evaluators, and federal and state governments all have their own 
interests and preconceptions, have all played their part in promulgating the 
assertion “we must evaluate curricula,” and can all interpret the proposition, 
when asserted as part of an ill-defined programatic stipulation, in any way 
they wish. As each group confronts the operational implications of its 
chosen definition, communication becomes more difficult. The curriculum 
worker with a commitment to subject-matter and its development in 
curricula makes little sense of evaluation as an expression of a commitment 
to measure the effectiveness of school learning. The evaluator with his own 
preoccupations does not understand the curriculum worker's feeling for the 
unimportance and wrongheadedness of evaluation. 


Evaluation 


Taken at face value the assertion “we must evaluate . . .” embodies 
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a simple idea, but that simplicity is deceptive. There is, for example, a 
basic debate about the responsibility of the evaluator to do more than 
describe. Glass (1968, pp. 4, 5) wrote: 


The current meaning of the term “evaluation” in several recent writ- 
ings and in federal legislation is that it is the gathering of empirical 
evidence for decision-making and the justification of decision-making 
policies and the values upon which they are based. Evaluation can — 
contribute to the construction of a curriculum, the prediction of 
academic success, or the improvement of an existing course. But these 
are the roles it can play and not its goal. The goal of evaluation must 
be to answer questions of selection, adoption, support and worth of 
educational materials and activities. . . . In the past we have avoided 
the goal of evaluation with its inherent threat to teachers, administra- 
tors, and curriculum developers and have concentrated on one or more 
of the non-threatening roles evaluation can play. ii 


Despite the clarity of Glass’s position and Scriven’s (1967) assertion 
that the evaluator must judge, Stake (1967, p. 527) questioned the willing- 
ness of evaluators to do more than process judgments: 


Whether or not evaluation specialists will accept Scriven’s challenge 
remains to be seen. In any case, it is likely that judgments will become 
an increasing part of the evaluation report. Evaluators will seek out 
and record the opinions of persons of special qualification. These 
opinions, although subjective, can be very useful and can be gathered 
objectively, independent of the solicitor’s opinions. A responsibility 
for processing judgments is much more acceptable to the evaluation 
specialist than one for rendering judgments himself. 

The evaluator thus becomes a person directing an evaluation. But 
this subterfuge will not do. Evaluation, as Stake (1967, 1969), Glass (1969), 
and Larkins and Shaver (1969) have made quite clear, must entail judg- 
ment. The English philosopher Urmson (1969, p. 204) stated the issue 
succinctly in a recently republished paper on grading: “At some stage 
we must say firmly (and why not now?) that to describe is to describe, 
to grade is to grade, and to express one’s feelings is to express one’s feelings, 
and that none of these is reducible to either of the others; nor can any of 
them be reduced to, defined in terms of, anything else.” Urmson viewed 
the intention to make a judgment as the defining characteristic of appraisal 
in all its forms. It is only in terms of intention that it is possible to dis- 
tinguish otherwise similar actions. Evaluation may (and probably must) 
involve description, but description does not necessarily involve evaluation. 

The seeming preoccupation of part of the hortatory and theoretical 
literature of curriculum evaluation with this description-appraisal issue 


241 


REVIEW OF EDUCATIONAL RESEARCH Vol. 40, No. 2 


suggests a concern for the ways in which the issue might be solved. Differ- 
ent conclusions about the role of description and appraisal in evaluation 
entail different elaborations of the technology. An emphasis on description, 
however tacit, implies a concern for the means of description without a 
necessary regard for how what is described bears on appraisal. An emphasis 
on the responsibility of the evaluator for appraisal implies a concern for 
criteria, their defensibility, and the means of application of criteria to 
concrete curricular phenomena. Description would be performed using 
categories that locate aspects of the phenomena relevant to appraisal; the 
very language of description would be created, self-consciously, to serve 
acts of judgment. 


Intentionality 


Intention, the context which defines how propositions and actions 
should be understood, is the basis of distinctions between evaluation and 
description. From this viewpoint, to claim to evaluate a curriculum is to 
make a claim about one’s willingness to apply criteria to judge a curriculum 
good or bad. However, different kinds of intention can involve different 
kinds of evaluation. Thus, Scriven’s (1967) well-known distinction between 
formative and summative evaluations highlights the possibility that differ- 
ent criteria might be appropriate to differing evaluative contexts. In forma- 
tive evaluation, data are used to make judgments about what works when 
a developer is trying to make his ideas and ideals come about. Summative 
evaluation is the considered assessment of some whole. An analogy can 
illuminate and extend the distinction: a writer struggles to present what he 
wants to say by writing and re-writing each sentence and paragraph in 
the light of judgments of sense and effect; he looks at and assesses 
final draft before deciding that he has achieved what he intended; his 
work is then evaluated by another critic who uses his own standards. These 
three evaluations of a piece of writing are performed with different in- 
tentions and, as a consequence, each invokes its own differently-appropriate 
criteria as the bases for assessment. 

It is unfortunate that Scriven’s distinctions have not been more often 
used and better elaborated. There should be, presumably, some optimal 
set of conditions facilitating the collection of useful feedback by develop- 
ment projects. Larkins and Shaver (1969), in one of the few papers 
reflecting these concerns, argued that a common confounding of formative 
and terminal evaluations in a one-shot evaluation inhibits sensible 
modification of a curriculum in its development period. Consequently 
appropriate revisions tend to be postponed past the time when they would 
be most useful, thus causing “the treatment being evaluated to differ 
substantially from the actual application of a curriculum once it is 
available to general use” (p. 4). As Larkins and Shaver pointed out, to 
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employ hard-nosed design and its purposes to collect feedback is to deny 
the sheer complexity of creative development when seat-of-the-pants 
appraisal might be all that is feasible and, for an immediate purpose, all 
that is required. 

Considerations of this kind seem crucially important in the execution 
and planning of a maximally useful formative evaluation, but they are not 
touched upon in Grobman’s (1968) primer, Evaluation Activities of Cur- 
riculum Projects, the only how-to-do-it book readily available. There seems 
to be a tacit acceptance within the curriculum evaluation literature of the 
essential validity of the development and appraisal practices of recent 
projects: the rapid preparation of a draft program and accompanying texts, 
the widespread dissemination of the draft to a sample of schools, observation 
of teaching with the draft over a school year, the testing of student achieve- 
ment after the treatment, and the appraisal of the draft by “experts” (Grob- 
man, 1969; chs. 5-9; Welch and Walberg, 1968). The utility and efficacy 
of these procedures for both development and evaluation should be ques- 
tioned. They contrast, for example, with the procedure that Simon (1963, 
p. 13) attributed to a Russian project in which a development team worked 
intensively within one classroom and subsequently within a small number 
of classrooms for several years. A clinical model of this kind seems to offer 
advantages when the intention of developers is a program that has attended 
systematically and carefully to problems of feasibility and didactic soundness. 

An emphasis on the role of intentionality as one of the central themes 
within curriculum evaluation makes the issues in the description-evaluation 
debate clearer. The debate itself, the triviality of many evaluations (satirized 
by Wolf, 1969), the continuing need to exhort developers to evaluate, and 
the lack of enthusiasm on the part of evaluators to judge are perhaps 
reflections of a soft-heartedness and soft-headedness that Jackson (1968, p. 
144) claimed is a general trait of elementary teachers. If Jackson’s picture 
of elementary teachers is correct and can be extended to include other 
educational personnel, it does much to illuminate the unwillingness of 
developers and evaluators to come to terms with the consequences of 
rationally-directed judgment and appraisal. There is, moreover, no intel- 
lectual tradition reflecting a sustained concern with problems of curricular 
management. Curriculum, the scholarly field most concerned with general 
problems of development, has abdicated from a position of rational and 
deliberative leadership of schooling (Schwab, 1969) with the consequence 
that the commitment to evaluation has been left to evaluators. Evaluation 
inevitably reflects, therefore, the technical preoccupations of workers whose 
experience is in design and methodology rather than curriculum and 
curricular management. 

Additional references: Heyel (1968); Lortie (1969); Maguire (1969); 
Seiler (1965). 
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Curriculum 


Curriculum evaluation only makes sense as it iS applied to curricula. 
It must presuppose sets of concepts that can locate evaluation-relevant 
aspects of curricular phenomena as well as coherent arguments (curricular 
rationales and procedures that can be appraised. Again Scriven (1967; pp. 
72-83) provided a general model for a kind of evaluation that deserves 
to be elaborated by filling his hypothesized appraisal-relevant distinctions 

with phenomenally specific terms. 

r Curriculum, however, is an amorphous concept; its connotation is much 
wider than is suggested by Scriven’s concern for materials appraisal and 
treatment effects. Nevertheless, stipulations about what curriculum should 
be, rather than what it is, tend to dominate evaluation methodologies. 
Anderson’s (1969) otherwise exemplary study, for example, shows an 
assimilation of curriculum development into materials preparation, materials 
preparation into package production, and package production into pro- 
gramed instruction in the course of an implicit prescription for formative 
evaluation. Formative prescriptions appropriate for programed instruction 
do not necessarily work in the appraisal of discussion materials; it is too 
easy to assert that they should. 

Curriculum refers also to the established program of a school. This usage 
has been recognized by evaluators but it has not been taken up as the context 
of a further class of evaluative methods. Westbury (1969) and Wise (1969, 
pp. 151-53) asserted the need for an on-going monitoring of the effectiveness 
of a school system; in making their case they have sketched ways in which 
the National Assessment of the Progress of Education and curricular aspects 
of cost-benefit analysis might be incorporated into a coherent decision- 
making frame (see also Alkin, 1968; Educational Policy Research Center, 
1969). Such a concern, stemming from the needs of curricular management, 
must be developed by means of prescriptions for organizational structures 
and instrumentation, measurement strategies and rationales, and the classes 
of phenomena that such evaluation might attend to. Westbury (1969, sec- 
tion 3) briefly outlined one context for such theoretical development. 

The sins of omission within curriculum evaluation that the above argu- 
ment points to illustrate a tendency on the part of curriculum evaluators to 
run away from the problems involved in acknowledging the centrality 
of “curriculum” or schooling as the phenomenon which evaluation must 
address in its own terms. Something of the character of the flight by evalu- 
ators from these issues is reflected in the shifting advocacies of Harris (1947, 
1963) during sixteen years. 

In his first paper Harris designated the plans, resources and processes 
of a school as the elements which a comprehensive evaluation must address. 
He ordered a logical evaluation into three steps: (1) determination of the 
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validity of a given set of criteria for the goals of a specific element; (2) an 
appropriate use of the criteria; and (3) planning for the mechanics of an 
appraisal. Harris (1947, p. 175) stated: “Whether or not the appraisal in 
its genesis follows this neat logical order, it should be possible to recon- 
struct the appraisal in this order and defend the kinds of solutions that 
have been arrived at for the various problems.” 

In his second paper (1963, p. 191) Harris severely narrowed the scope 
of this prescription by distinguishing appraisal from evaluation and defining 
evaluation as “the systematic attempt to gather evidence regarding student 
behavior that accompanies planned educational experiences.” This stipula- 
tion defined a usage that is commonly accepted. Thus, Wiley (1968, p. 3) 
claimed that “evaluation consists in the collection and use of information 
concerning changes in student behavior to make decisions about an educa- 
tional program,” 

There is a justifiable unease among some evaluators about the narrow- 
ness of definitions of this kind. Forehand (in press), for example, wants 
to include under curriculum evaluation more than Harris (1963) or Wiley 
would permit him to do; he contrasts “project evaluation,” which appeals 
to student learning criteria, with “institutional evaluation,” which answers 
questions such as “Do the objectives underlying the program adequately 
reflect, in nature and scope, the institution’s goals? . . . Are the objectives 
posed by the institution satisfactorily met? . . . Does the benefit achieved 
by the program justify the expenditure of resources?” These questions, 
which many other evaluators (e.g., Stake, 1967) would also want to ask, 
are reminiscent of Harris’s (1947) initial definition of the appraisal task. 
Those who adopt this position want more than Harris’s later, restricted 
definition would allow; yet they are not able to explicate successfully, as 
the proponents of the narrower position can, what their stand implies 
for evaluation. Forehand (in press) acknowledged this gap between his 
desires and his ability to develop their implications in formal terms. As 
long as this gap. remains (and it does), evaluators face the danger of 
claiming to do more and being able to do more than they can or want to do. 

The task of bridging this gap is the most important and most difficult 
theoretical problem within evaluation. Two separate, though interrelated 
ced: curriculum must be conceptualized 


analytical problems must be fa A es fi 
in such a way that it no longer carries the connotation that it is a unitary 


notion, often a treatment; evaluation must be seen in ways that permit 
the development of sets of methods and criteria so reasoned judgments, 
appropriate to all senses of curriculum, become possible. Curriculum evalua- 


tion theorists must attempt to formalize these criteria and methods so 


they can prescribe rules for the application of criteria to the full range 
of concrete curricular issues. 
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No current theoretical prescription for curricular evaluation approaches 
these goals, although parts of the problem have been acknowledged by 
some writers. Larkins and Shaver (1969) and Glass (1968, 1969) claimed 
that a value analysis must be associated with an accurate description of 
the impact of a curriculum. Larkins and Shaver (1969, p. 6) wrote: 


Too little attention has been paid to improving, perhaps by 
formalizing, procedures related to this task. We are suggesting that 
evaluation design ought to be broadened to include more than data 
gathering and analysis design. It is not simply that procedures 
need to be established for obtaining assessment data that are more 
adequate, but that adequate evaluation design requires procedures 
for both obtaining adequate assessment data—assessment design— 
and procedures for weighing the worth of the assessment data. 
These latter procedures will hereafter be referred to as value 
analysis. 


Value analysis presumably involves the development of criteria and their 
application to particular phenomena as represented by data. 


Although Larkins and Shaver’s stage-model properly reflects the tem- 
poral ordering of an evaluation study and immediate research priorities, 
it is not a sound basis for development of a theoretical context for evaluation. 
Data must be sought before they can be considered adequate for the 
purposes of a study, and the kind of data sought depends on prior concep- 
tions about what should be investigated. Thus, the assumption of Harris 
(1963) and Wiley (1968) that evaluation should be defined in terms of 
information “concerning changes in pupil behavior” tacitly invokes student 
learning defined in some way as the end of education. Without denying 
the validity of student learning as an end of education one can look at 
a program in terms of other ends: as a representation of a subject, as a 
response to social conditions, as teachable, or as valid training. Any of these 
ends, and others, must be accommodated by a theory of curriculum evalua- 
tion since any one might be appropriate to a particular problem. 
$ The possibility that a curriculum might serve an education which has 
intrinsic value or is an object in its own right must also be addressed. Mann 
(1969, p. 40) drew on this possibility to suggest an intriguing, although 
totally satisfactory, prescription for a curricular criticism that has as its 
starting-point the assumption that “the world we create for children 
through the curriculum is a real present world, a lived-in world, and a 
meaning world.” He argued that any criticism of a curriculum presupposes 
ethical and aesthetic judgments about the meaningfulness of the world 
created for children in the here-and-now. He (1969, pp. 38, 39) wrote: 
“Curriculum, it must be remembered, is a form of influence over persons, 
and disclosures of meanings in curriculum are disclosures about the char- 
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acter of an influence. . . . The meanings a curriculum critic discloses, 
then, are meanings about which he believes ethical judgments are to 
be made.” 

Mann’s model in his sketch of a curriculum criticism is literary 
criticism. This analogy is intriguing because it embraces many of the 
problems within curriculum evaluation: the need to judge, the problem 
of criteria, and the methods for the application of highly general criteria 
to particular literary objects. It seems that curriculum evaluation should 
explore the distinctions and methods developed within literary criticism 
as one way of exploring a variety of possible approaches in an analogous 
discipline to some of its theoretical problems. Olson (1965), for example, 
is explicitly concerned with the contrasts between critical traditions which 
see art as an “instrument productive of certain effects” and those which 
explore literary objects as things standing in their own right, apart from 
their influence on and independent of their readers or audiences. This 
distinction defines some of Mann’s (1969) concerns: his treatment of the 
problem shows that literary approaches have analogies in curriculum. The 
details of such analogies seem to be worth explicating. 


Examples of Curriculum Evaluation 


I attempted (Westbury, 1969) a fragmentary comparison of the reports 
of two approaches to the remediation of the characteristic disabilities of 
inner-city children, the Bereiter and Englemann (1966) program and 
the New York First Street School (Dennison, 1968). I accepted, for the 
purposes of argument, the claims of the programs that they were successful 
but asserted that an assumption that the programs were therefore in any 
way equivalent would be absurd. 

I argued that the radical differences between the programs were the 
results of different philosophical starting-points, the basic assumptions 
about what phenomena should be attended to. Bereiter and Engelmann 
presumed that human behavior is lawful, predictable, and ordered and 
that this order makes it possible to search for rules to control instructional 


model governed by rules. Bereiter and Engelmann assumed a clear distine- 
tion between the knower and the known and gave no place to feeling 
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way as a surgical technique; Dennison cannot accept this and is required 
to write a persuasive exhortation that rests on a series of vignettes and 
loosely-made suggestions to describe what school experience might be. To 
evaluate curricula formed in these ways, one must acknowledge that these 
conceptions of curricula are appropriate to higher-order principles; under- 
standing what such higher-order principles involve must constrain the 
form of an evaluation. Yet, within this constraint, it is possible to evaluate 
curricula by using criteria of consistency, coherence, and subtlety since 
such terms are appropriate to a particular embodiment of a higher-order 
structure. 

Anderson (1968) illustrated how this evaluative prescription might 
work in her critical analysis of the Bereiter and Engelmann, Headstart, 
and Montessori prescriptions for preschool curricula. To organize her 
analysis she constructed a set of questions based on terms from two sets 
of commonplaces: teacher, student, milieu, knowledge and aims, methods, 
content, organization. She subjected all three of the programs to her 
questions as a way of organizing her understanding of what each program 
was; she then compared each with the others to highlight their principles 
and entailments. Her initial questions thus became a set of categories 
for evaluation-impregnated description. Her criterion for evaluation of 
each of her descriptions was internal consistency; her contrasts among 
programs suggested the concerns that each failed to develop. 

Anderson’s arguments lack depth and subtlety, but her approach 
showed ways to the appraisal of a curriculum qua curriculum. She demon- 
strated how a program’s claims may be understood and accepted before 
some other set of terms (that may or may not respect a program’s inten- 
tions) is introduced from the outside. 

Crittenden (1970) developed these concerns fruitfully in another 
evaluation of the Bereiter and Engelmann program. He criticized four 
aspects of the theoretical development of this program: (1) an apparent 
neglect by Bereiter and Engelmann of the social concerns embedded in 
the Bernstein formulation of the role of the “elaborated” and “restricted” 
codes in the development of intelligence; (2) inadequacies in the logical 
and epistemological analysis of language and meaning which vitiate the 
language hierarchy that is basic to the curriculum; (3) the failure of the 
Bereiter and Engelmann manual to offer users a considered and responsible 
treatment of the logical bases of the language hierarchy; and (4) theoretical 
and practical fallacies stemming from Bereiter and Engelmann’s poorly 
developed view of relations among schooling, social organization, and 
social philosophy. 

Crittenden’s appraisal of the Bereiter and Engelmann program is an 
exemplary model of one kind of curriculum evaluation. Implicit in his 
critique is a demand noticed above for appraisal of curriculum qua curricu- 
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lum. He assumed that program planners attend carefully and specifically 
to the over-all rationale of prescriptions as much as (or more than) 
they attend to the claims that might be made for student learning. Any 
claims for learning or other consequences are thus contained in and stem 
from these arguments. 

Crittenden’s model for criticism is self-consciously theoretical in 
method and technique; nevertheless, and despite the theory, his analysis 
does bear in some interesting ways on problems that any hard-nosed 
administrator might be concerned with. These consequences are illustrated 
in Weikart’s (1969) initial evaluation of student gain after instruction 
under three different nursery school regimens: a unit-based curriculum 
emphasizing the social-emotional goals of the traditional nursery school, 
the cognitively-oriented Ypsilanti Perry Preschool Project, and the Bereiter 
and Engelmann program. 

Balanced treatment groups of retarded disadvantaged children were 
taught by teachers expressing a preference for one of the three curricula. 
The groups were compared after a year of half-day teaching at school, 
plus home teaching for 90 minutes every other week, in the appropriate 
curriculum style. After the treatments there were no significant differences 
among the groups on measures of intelligence and social-emotional develop- 
ment, although IQ gains were unusually large for all groups. Weikart 
(1969, p. 10) stated: “Thus, even though the results of the programs are 
the same, when children are measured on general tests, it may be assumed 
that the operation was actually different for each.” 

Weikart tentatively attributed the improvement in IQ and social- 
emotional adjustment to four curricular factors which were constant in 
all treatments: (1) the specific theoretical framework of each program 
which helped the teacher select activities, match instructional procedures 
with assumptions about outcomes, etc.; (2) the staff model which required 
intensive planning, tight organization of the programs, and commitment 
from the teachers; (3) home teaching involving the children’s mothers 
and an assumed consequence that homes were supporting the school 
experience; and (4) heavy verbal emphasis in all programs. Weikart argued 
that the constancy of results, whatever the treatment, was a product of 


a functional equivalence across the programs in these conditions. There 


may well be other ways of interpreting his results than the one Weikart 


invoked, e.g., a Pygmalion effect, but his suggestions have at least one 
interesting implication for the evaluation of curricula. 


The 1.1eaningfulness of a teacher commitment to theoretically-devised 


curricular procedures depends on the coherence and validity of the cur- 
ricular rationale or theoretical framework and on an understanding of 
that rationale by teachers. Constancy and ordered coherence in a curricular 
approach in the fact of the infinite number of practical exigencies in a 
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classroom would seem to depend on complete mastery by a teacher of 
the pedagogical and theoretical structure of a given program. Crittenden 
(1970) castigated Bereiter and Engelmann for weakness in the theory 
they presented to possible users in their published report and there 
is no doubt that this criticism could be extended to most other curricular 
prescriptions. 

A major and well-wrought study by Herron (1969) showed indirectly 
how a teacher’s misunderstanding of a program might be caused, might 
affect teaching, and thus might militate against the effectiveness of a 
program. Herron subjected CHEM Study chemistry, PSSC physics, and 
the Blue Version of BSCS biology to a set of separate, but interrelated 
analyses as a way to explore three evaluative questions: (1) To what 
extent is the “enquiry” objective of these programs actually embodied in 
the materials produced? (2) How do the teachers through whom the mate- 
rials filter perceive this objective and do they understand “enquiry” well 
enough to operationalize any conception of what it might mean in their 
classrooms? and (3) How does this objective compare to the explicit and 
implicit goals teachers set in their classrooms? 

The basis of Herron’s appraisal was a framework of terms describing 
scientific enquiry drawn from Schwab’s (1960) reconstruction of what 
scientists do. Herron tested this initial framework by applying it as a 
structure for the analysis of the views of scientific method commonly cited 
by developers as their sources of a view of method. This initial application 
became a test, which was passed, of the robustness of his analytic scheme. 

This scheme was then applied to two parts of the implicit curriculum 
of the science projects: (1) the text and laboratory materials and (2) the 
teachers. The results of his application were disappointing. Despite the 
claims of the developers for their materials, they were found to present 
little more than a “somewhat sophisticated” version of a “less competent” 
view of method, The teachers who had been attending workshops on the 
new materials were found to have almost no conception of what might be 
meant by a claim to teach the “nature of scientific enquiry.” 


_ Herron’s study is worth pondering in at least three contexts. (1) It 
raises questions about the effectiveness of the formative evaluation procedure 
of an appeal to experts that was used by the developers of these programs 
as the basis for the justification of content: How were these experts chosen 
and what criteria did they use to evaluate “content”? (2) It raises the 
irksome problem of what an appeal to learners could produce when teachers, 
the primary agents of the curriculum, do not themselves understand what 
they are supposed to be doing. As a corollary his study suggests the im- 
portance of formative teacher-curriculum and teacher-materials evaluation 
as one essential step in appraising the viability of the curricular idea. (3) 
Herron’s method illustrates one means of elaboration of the general sum- 
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mative model that the consideration of Crittenden’s study suggested: the 
stated goal of the program, in this case the aim to teach the notion of 
“science as enquiry,” is treated as a claim for a program that can become 
a program-specific, yet general, criterion that defines the terms of an 
evaluation. This criterion is then systematically explicated to produce a 
set of formally-derived descriptive categories that allow a program to be 
approached through its parts. Evaluation may then take place with the 
possibility of two kinds of judgment: (1) a category is or is not treated, 
or (2) a category is adequately or inadequately developed. Clearly, this 
general summative model may equally well be used for a truly formative 
evaluation. Within a frame of this kind, evaluation and the concomitant 
description can become methodical and systematic, although not necessarily 
as rational and empirical as Scriven (1967, p. 49), for one, said they can be. 


Systematic Curriculum-Evaluation 


If Urmson’s (1969) explicitly-given direction is followed, the two 
essential components of an evaluative argument are a set of appraisal- 
related descriptive categories that permit an appropriate ordering of cur- 
ricular phenomena and a set of normative rules that make appraisal of 
the curricular rationales and practices possible. : 

These preconditions for curricular evaluation are widely recognized. 
Stake (1967), for example, suggested antecedents, transactions, and out- 
comes as descriptive categories which should be attended to in evaluative 
description and congruence and contingency as criteria which should control 
an over-all appraisal. Although these categories are useful in a general 
sense, they are not close enough to curricular phenomena to be immediately 
helpful; they do not direct an evaluator precisely enough to the phenomena 
he is supposed to look at. However, the more specific and more curricular- 
relevant orderings of questions about curriculum guides suggested by 
Stevens and Morrissett (1968), Tyler and Klein (1968), Klein and Tyler 
(1969), and Payne (1969) do little more than point out the most obvious 
questions that could be asked of a curricular document. They do not lead 
a critic into the structure of a program as such structure bears on more 
complex questions of evaluation. As devices to assist hard-nosed evaluation 
these suggestions are too atheoretical and say too little to be helpful; at 
the same time they presume far too much from those who wpuld need 
such truncated and obvious schemes. Mnemonic sequences of questions 
of the kind these schemes represent must be embedded in a more sophisti- 
cated evaluative and descriptive language than that available to the 
typical users of these schemes. \ 

Neither of these approaches to appraisal seems significant or helpful. 
Another way to the development of a useful language for curriculum 
evaluation would be to approach the task by means of a search of curricu- 
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lum evaluation and research literatures that did contain useful and 
defensible theoretical terms. Work of the kind represented by Herron’s 
(1969) analysis of science as enquiry or Weikart’s (1969) discussion of 
the role of teacher understanding, instructional organization and coherence, 
and parental support in achievement must precede and systematically im- 
pregnate the development of checklists of things to be attended to. 


If this point of view can be accepted, parts of curriculum evaluation 
merge with curriculum research. Thus, Eash’s (1969, pp. 19, 20) mnemonic 
listing of questions properly asks evaluation questions about the form 
of task analysis implicit in curricular materials: “Has a task analysis been 
made of the material and some relationship specified between the tasks? 
If a task analysis was made, what basis was used to organize the mate- 
rials?” The terms used to answer these questions must presume and be 
embedded within the theoretical languages that can bridge analysis of 
the nature of subjects and the character and behavior of the learner. These 
problems are, at this point, treated in the literature of psycho-technology 
(e.g, Glaser, 1965) rather than in curriculum evaluation. 

There does not seem to be any good reason for making distinctions 
between curriculum evaluation and curriculum research at the point at 
which descriptive categories are developed. Both literatures contain cate- 
gories and terms that can describe curriculum phenomena and expand the 
range of alternative ways in which any given curricular problem can be 
seen, In fact, the relationship between research and evaluation is so close 
that it would profit evaluators to search the research literature as part 
of a systematic attempt to establish a roster of potentially useful categories. 
One attempt to undertake an enquiry of this kind was reported by the 
Swedish Project Compass (Dahlléf and Lundgren, 1969; Dahlléf, 1969). 
Using data from the Stockholm study of the effects of streaming (Svensson, 
1962), Dahll6f showed: that varying the amount of time spent in instruc- 
tion had an effect on student achievement. He drew from Carroll (1963) 
and Bloom (1968) in making “time in instruction and homework” an 
intervening variable in curricular process (or instructional) effectiveness. 

Evaluation studies also provide grist for the research mill. In the 
course of his appraisal of the teaching of Canadian history, Hodgetts (1968) 
found that although the subject as it is taught currently in schools is 
a fair reflection of the scholarly interests of academic historians and 
political scientists in the 1920’s and early 1930’s, it has little resemblance 
to anything that might pass as the concerns of contemporary scholars. 
He pointed to this contrast in concerns not to make the common charge 
that the school programs have failed to draw upon contemporary research, 
but rather to claim that the currently-taught courses in Canadian history 


are built around anachronistic and, therefore, truly irrelevant questions 
and issues. 


252 


WESTBURY CURRICULUM EVALUATION 


Hodgett’s questioning of the subject-matter that passes as the school 
subject of Canadian history raises problems in the conceptualization of 
subjects and relevance. He suggested implicity that a school subject is a 
social institution with built-in inertia and a functionally necessary capacity 
for control and order. Texts, examinations, cquivalence of credit, curriculum 
sequence, teacher training, and teacher development all work within 
parameters that must be defined by a conception of a subject as a social 
institution that is conservative and ordered (Ziman, 1968). It is this 
order which a fundamental curriculum development must face if it wants 
change and must acknowledge as its goal if it wishes reform. Hodgetts also 
suggested that relevance is a problem centering on the accessibility of a 
subject to students who cannot share the developed theoretical and intel- 
lectual preoccupations of scholars. A curriculum development must search 
for forms of congruence between student interests and those manifested by 
and contained within a subject. Ausubel (1967) invoked a parallel formu- 
lation of relevance to make a case for the curricular validity of the applied 
sciences for junior high school. Neither of these arguments made in the 
context of evaluation can draw on a tradition of concern for a similar issue 
within curriculum research to formulate its claims more adequately. 


Methodical Curriculum-Evaluation 


Categories of the kind illustrated above and their terms must be entered 
into a curricular argument or rationale. As was claimed above, the categories 
that are entered and the structure that holds them in an argument become 
the two elements on which an evaluation can focus. The general form of 
this kind of evaluation was outlined above in the discussion of Herron’s 
(1969) study. Herron used categories implicit in the projects he was con- 
cerned with. It is, of course, possible for an evaluation to be focused on 
some different set of elements within a given program than that implicit in 
its rationale; Crittenden (1970) did this in his analysis of the social phi- 
losophy embedded in the Bereiter and Englemann program. However, an 
evaluation using some structure or normative frame different from that 
embedded in any program is predicated on a sound understanding of the 
program being discussed. A curriculum must be understood in its own 
terms before being evaluated in other terms. Nevertheless, in both kinds 
of evaluation assumptions must be made about the forms of ‘argument 
generally appropriate to curriculum; the generic forms of argument become, 
therefore, a consideration to be faced by a generalizing theoretical structure 
for curriculum evaluation. 

MacMillan and McClellan (1968, p. 146) claimed that the structure 
of curricular arguments is usually and properly a form of means-ends 
reasoning, in itself a morally neutral form and the “most natural way of 
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describing-explaining-justifying teaching.” They suggested that the form, 
as it is applied to instructional arguments, entails two steps: first, a pro- 
cedure is described using some set of descriptive categories to locate 
phenomena, and then an end is specified either as being achieved or to be 
achieved by the procedure. Obviously means and ends (as they are con- 
tained within an argument) must be related logically, theoretically and 
empirically; conversely particular sets of claimed connections can be 
appraised for their logic, their theory and their empirical justification. 
Insofar as ends are logical objects, which is all MacMillan and McClellan 
(1968) claim, they are not behavioral objectives; their argument is not 
therefore affected by the current objections to behavioral specification and 
the like (see Popham, 1969).) 

The number of problems that this form of argument involves, even 
as it is restricted to the end of student learning, is evident in attempts to 
specify, however loosely, the phenomenal categories needed to describe 
instruction. Weiss (1969) suggested that there are six theoretically-possible 
sets of instructional interactions that must be considered by both develop- 
mental and evaluation arguments: teacher-learner, teacher-materials, 
teacher-milieu, student-materials, student-milieu, and materials-milieu. The 
full range of interactions that such a simple manipulation of commonplace 
eo can generate when the possibility of other ends is entertained is much 
arger. 

A prescriptive theory of evaluation must, however, be taken further 
than MacMillan and McClellan needed to do for their argument. It requires 
some specification of the range of actual methods that can be invoked for 
the range of concrete evaluation problems. Design is one such sub-set of 
rules for one class of evaluative inference. Equally methodical prescriptions 
are urgently needed for handling curricular appeals to other classes of 
ends. The diversity of problems faced by evaluation, and the complexities 
of concrete situations in which evaluation can and must be conducted (and 
is appropriate) deny the possibility of a single prescriptive method. It 
seems that the method of methodical curriculum evaluation must parallel 


the eclectic methods and practical character that Schwab (1969, pp. 1, 2) 
prescribed for curriculum itself: 


There will be a renaissance of the field of curriculum, a renewed 
capacity to contribute to the quality of American education, only if the 
bulk of curriculum energies are diverted from the theoretic to the 
practical, to the quasi-practical and to the eclectic. By “eclectic” I 
mean the arts by which unsystematic, uneasy, but useable focus on 
a body of problems is effected among diverse theories, each relevant 
to the problems in a different way. By the “practical” I do not mean 
the curbstone practicality of the mediocre administrator and the man 
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on the street, for whom the practical means the easily achieved familiar 
goals which can be reached by familiar means. I refer, rather, to a 
complex discipline, relatively unfamiliar to the academic and differing 
radically from the disciplines of the theoretic. It is the discipline con- 
cerned with choice and action, in contrast with the theoretic, which 
is concerned with knowledge. Its methods lead to defensible decisions, 
where the methods of the theoretic lead to warranted conclusions, and 
differ radically from the methods and competences entailed in the 
theoretic. 


No such practical and eclectic approach to curriculum evaluation has been 
reported. It is difficult to find even approaches to this idea. The best that 
can be done is to review some studies evaluating curricula in unusual ways, 
which do suggest forms within which some evaluative problems can be 
handled systematically. Such a tentative review, at best, can highlight 
problems that curriculum evaluation can treat and can underwrite the 
claim for eclecticism in method. 


Examples of Methodical Curriculum-Evaluation 


Dahllöf (1960) and Husén and Dahllöf (1960, 1965) reported an 
interesting use of activity analysis to explore the linkages between the 
mathematics and language arts curricula of Swedish junior high schools 
and a number of components of the social milieu of the curriculum. They 
found, not unexpectedly, that the subject-matter being taught could be 
categorized by its usefulness to different vocational orientations, but that 
many topics being taught could be weeded out of the program to make 
room for universally useful skills that were being poorly handled, e.g., 
applied geometry, oral communication, and reading for factual content. 

The activity analysis approach to curriculum selection is, as Dahllof 
repeatedly stated, narrowly and dangerously utilitarian. Overused it poses 
the threat of the same denial of the richness of the concept of education that 
all appeals to external ends must face. Yet, despite this danger and the 
conceptual simplicity of DahllöPs methodology, his studies do offer a form 
of analysis and appraisal that can handle claims for simple functional 
relevances that curriculum theory has barely acknowledged. Although 
rarely recognized or evaluated as such, claims for curriculum change and 
reform made by means of appeals to socially derived ends are often the 
most persuasive sources of change within educational institutions. 

Problems of the social utility of schooling are most often seen as 
concerns for educational planners, vocational trainers, manpower planners, 
and the like. The narrowness of focus within curriculum is not shared 
by members of these other professions. Anderson (1967), for example, 
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treated seven “planning” problems critically in one short pamphlet, yet 
three of these topics were explicitly curricular. This planning literature, its 
methodologies and forms of critique should be seen as part of the literature 
of curriculum evaluation. The London Economist (1968) reported an 
Organization for Economic Cooperation and Development analysis of the 
effectiveness of the European attempts to divert resources to science 
education in the interests of national growth. The OECD report showed, 
through demographic and economic analysis, that this end was not being 
served. The report raised the possibility that the curricular emphasis on 
sophisticated science education for a few was one source of the slow 
diffusion of scientific research and development through industry. 


Obviously, whether for science education or vocational training, policy 
arguments of this kind have profound implications for the structure of 
the school curriculum. The evaluation of these arguments, insofar as they 
are proposals which bear on the “technical parameters of human compe- 
tency formation production functions’ (UNESCO, 1968, p. 605), is a 
problem for sophisticated econometric and sociological analysis. Yet to the 
extent that education makes claims of this kind with all their public policy 
implication (e.g., that vocational education programs must be introduced 
into a state or regional system to ameliorate unemployment), curriculum 
evaluators must be aware of both the possibility of appraising these proposals 
and the powers and limitations of the forms that do handle them. Cohen 
discussed policy evaluation problems in Chapter 2 of this issue of the Review. 


Additional references: Corazzini (1968); Ribich (1968). 


Conclusion 


In a short but pointedly satirical paper, Wolf (1969) caricatured the 
efforts at systematic appraisal that have marked education’s response to 
the federal requirement of curriculum evaluation. His description (Wolf, 
1969, p. 108) of the colloquial method is representative of his point of view. 


Social psychological research has demonstrated that decisions 
arrived at by a group will achieve greater acceptance than decisions 
arrived at by an individual. This finding is the basis of the 
colloquial method. In applying this method, one need merely 
assemble a group of people who have been associated with a 
particular program to discuss its effectiveness. After a brief dis- 
cussion, the group will usually conclude that the program has been 
indeed successful. This conclusion can then be transmitted to 
funding agencies and other school personnel. It is unlikely that 


such evaluations will be challenged since they have been arrived 
at by a group. 


‘ 
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After conducting this review I share Wolf’s cynicism. He did not 
capture the concern and integrity of the work discussed in this paper, 
yet he did describe, however indirectly, much of the work that was seen, 
but not cited. A path through this literature, as it looks when assembled 
under the heading “curriculum evaluation,” must be unresponsive to the 
contours of the bulk of the literature. A review that follows such a path 
must be idiosyncratic. 

Two problems stand out after this review: the issue of intention, 
the context that defines what curriculum evaluation should be about and 
for, and the issue that centers on the conceptualization of the foundations 
on which curriculum evaluation must be built. Curriculum evaluation shares 
both of these problems with curriculum; as such curriculum evaluation 
shows the other face of the bankruptcy of curriculum as a scholarly or 
practical (in Schwab’s, 1969, sense) field—there can be no curriculum 
evaluation that is not intertwined with curriculum development, and 
curriculum evaluation is an immediately important goal. However, to take 
such a need as a prescription for immediate theorizing is to misread the 
character of the theory that is required and the role of such theory in 
enquiry. Theory must inform the deliberation that is evaluation but at 
the same time it must grow from deliberation. The problem implicit in 
this assertion is mapped by the requirement that curriculum and evalua- 
tion workers find a theoretical structure that permits them to embrace 
the particular and concrete with seriousness before they attempt theoretical 
speculation of any kind. We are far from this at the moment. 


Bibliography 


Alki i luation Model: A Systems Approach. CSEIP Workin, 
Mine Ne. vi Toi rn cf Calif, Center for the Study of Evaluation a 


Instructional P s, 1968. (Offset.) 7 X 
Anderson, C. Arnold. The Social rca i a ma Planning. Paris: UNESCO, 
i i anning, x r 
Aiae OA Mi Analysis of Three Programs for Pre-School Disadvantaged Children. 
rs is. Chicago: Univ. of Chicago, ; ; f 
Acd@on’ RIAA oni ‘Comparative Field Experiment: An Illustration from High 
School Biology. Proceedings of the 1968 Invitational Conference on Testing Prob- 
lems. Princeton, N. J.: Educational Testing Service, 1969. Pp. 3-30. meen 
Ausubel, David P. Crucial Psychological Issues in Objectives, Organization, and Evalua- 
tion of Curriculum Reform Projects. Psychology in the Schools 4:111-21; 1967. 
Bereiter, Carl and Engelmann, Siegfried. eh ete Children in the 


P. ii iffs, N. J.: Prentice-H F 
Bloor enoo E E for eg. Evaluation Comment 1, No. 2. Los Angeles: 


Univ. of Calif, Center for the Study of Evaluation of Instructional Programs, 
May 1968. | 

Burke, Kenneth. The Philosophy of the Literary Form. New York: Vintage Books, 
1957. 


Carroll, J. B. A f School Learning. Teachers College Record 64:723-33; 1963. 
Crittenden Bian odel Taes Cultural Disadvantages: The Bereiter-Engelmann 
Preschool Program. School Review 78: 145-68; 1970. 


257 


REVIEW OF EDUCATIONAL RESEARCH Vol. 40, No. 2 


Dahlléf, Urban. Kursplaneundersékningar | Matematik Och Modersmalet, 1957 Ars 
Skolberedning. III. Statens Offentliga Utredningar 1960: 15. Stockholm: Ecklesiastik- 
departemenet, 1960 (with English summary). 

Dahlléf, Urban S. Ability Grouping, Content Validity, and Curriculum Process Analysis. 
Reports from the Institute of Education, Univ. of Göteborg, Göteborg, Sweden: The 
Institute, 1969. (Mimeo.) 

Dahlléf, Urban S. and Lundgren, Ulf P. A Project Concerning Macro-Models for the 
Curriculum Process. A Short Presentation. rts from the Institute of Education, 
Univ. of Göteborg. Göteborg, Sweden: The Institute, 1969. (Mimeo.) 

Dennison, George. The First Street School. New American Review, No. 3. New York: 
New American Library, 1968. Pp. 150-71. 

Eash, Maurice J. Assessing Curriculum Materials: A Preliminary Instrument. Educa- 
tional Product Report 2: 18-24; 1969. 

Educational Policy Research Center. Toward Master Social Indicators. Research Mem- 
orandum EPRC 6747-2, Menlo Park, Calif.: Stanford Research Institute, 1969. 
naar ye Gailig A. Functions of a Curriculum Evaluation System. Teachers College 

‘ecord, in press. 

Gagné, Robert M. and Gephart, William J. (editors). Learning Research and School 
Subjects. Eighth Annual Phi Delta Ka; Symposium on Educational Research. 
Itasca, Ill.: F. E. Peacock Publishers, 1968. 

Glaser, Robert. Toward a Behavioral Science Base for Instructional Design. Teaching 
Machines and Programmed Learning, II. Data and Directions. (Edited by Robert 
Glaser.) Washington, D. C.: Dept. of Audiovisual Instruction, National Education 
Assoc., 1965. Pp, 771-809. 

Glass, Gene V. Comments on Professor Bloom’s Paper Entitled “Toward a Theory of 
Testing Which Includes Measurement-Evaluation-Assessment.” Paper presented to 
S; um on Problems in the Evaluation of Instruction, Univ. of Calif., Dec. 
1967. CSEIP Occasional Report No. 11. Los Angeles: Univ. of Calif., Center for 
the Study of Evaluation of Instructional Programs, 1968. 

Glass, Gene V. The Growth of Evaluation Methodology. Research Paper No. 17. 
Boulder: Univ. of Colo., Laboratory of Educational Research, 1969. 

Grobman, Arnold B. The Changing Classroom: The Role of the Biological Sciences 
Curriculum Study. Garden City, N. Y.: Doubleday, 1969. 

Grobman, Hulda. Evaluation Activities of Curriculum Projects, AERA Monograph 

ne ie Chae Wee a No. 2. bir o: Rand McNally, 1968. J 

> r W. a School-Pro! x Educa- 
tional Research 41; eke ieee see Saa parnak of 

Harris, Chester W, Some Issues in Evaluation. The Speech Teacher 12: 191-99; 1963. 

Herron, Marshall D. The Nature of Scientific Enquiry as Seen by Selected Philosophers, 
ened Teamen, and Recent Gunieuh lar Materials. Doctor's thesis, Chicago: Univ. 

Hodgetts, A.B. (director). What Culture? What Heritage? A Study of Civic Education 
in Canada. Report of the National History Project, Curriculum Series, No. 5. 
Toronto: Ontario Institute for Studies in Education, 1968. 

Husén, Torsten and Dahlléf, Urban. Mathematics and Communication Skills in School 
and Society. Stockholm: Industrial Council for Social and Economic Studies, 1960. 

Husén, Torsten and Dahlléf, U. Curriculum Research in Sweden 11-Mathematics and 
Communication Skills in Secondary Schools. Educational Research 7: 167-73; 1965. 

Jackson, Philip W. Life in Classrooms. New York: Holt, Rinehart and Winston, 1968. 

Klein, M, Frances and Tyler, Louise. On Analyzing Curricula. Curriculum Theory Net- 
Song 16a of Curriculum, Ontario Institute for Studies in Education) 3: 10-25; 

Larkins, A. Guy and Shaver, James P, Hard-Nosed Research and the Evaluation of Cur- 
riculum. Paper presented to annual meeting of American Educational Research 
Assoc., Feb. 1969. Logan: Utah State Univ., ‘College of Education. (Mimeo.) 

London Economist. Economist, Mar. 16, 1968. Pp. 71-76. 

MacMillan, C. J. B. and McClellan, James E, and Should Means-Ends Reasoning 
Be Used in Teaching? Concepts of Teaching: Philosophical Essays. (Edited by C. 


258 


WESTBURY í CURRICULUM EVALUATION 


]. B. MacMillan and Thomas W. Nelson.) Chicago: Rand McNally, 1968, Pp. 119- 
50. 

Mann, John S. Curriculum Criticism. Teachers College Record 71: 27-40; 1969. 

National Advisory Council on Education Professions Development. Evaluation of Edu- 
cational Nae: Washington, D. C.: U. S. Office of Education, undated. 
(Mimeo. 

Olson, Elder. Introduction. Aristotle's “Poetics” and ish Literature. (Edited by 
Elder Olson.) Chicago: Univ. of Chicago Press, 1 Pp, ix-xxviii. 

Payne, Arlene. The Study of Curriculum Plans. Washington, D. C.: National Educa- 
tional Assoc., 1969. 

Popham, W. James et al. Instructional Objectives. AERA Monograph Series on Curricu- 
lum Evaluation, No. 3. Chicago: McNally, 1969. 

Scheffler, Israel. The Language of Education. Springfield, Ill.: Charles C. Thomas, 1960. 

Schwab, Joseph J. What Do Scientists Do? Behavioral Science 5: 1-27; 1960. 

age Joseph J. The Practical: A Language for Curriculum. School Review 78: 1-23; 

9. 


Scriven, Michael. The Methodology of Evaluation. Perspectives on Curriculum Evalua- 
tion. (Edited by Robert E. Stake.) AERA Monae Series on Curriculum Evalu- 
ation, No. 1. Chicago: Rand ‘ally, 1967. Pp. 

Simon, Brian. Introduction. Educational Psychology in the U. S. S. on by Brian 
Simon and Joan Simon.) London: Routledge and Kegan Paul, 1963. Pp. 1-18. 
Stake, Robert E. The Countenance of Educational Evaluation. Teachers College Record 

: 523-40; 1967. 

Stake, Robert. Generalizability of Program Evaluation: The Need for Limits. Educa- 
$ tional Product Report 2: 39-41; 1969. H seat Kia 
tevens, W. William, Jr., and Morrissett, Irving. A System for yzing Social Science 

Curricula. Curriculum T Network (Dept. of Curriculum, Ontario Institute 
P for Studies in Education) 1: oe 1968. PR een Yo 
vennson, N. E. Ability Groupin; Scholastic Achievement. Repor 

Follow-up Study in Stockholm, Stockholm Studies in Educational Psychology, No. 
al 5, Stockholm: Almqvist and Wiksell, cK , pe Tain Donal 
yler, Louise and Klein, Frances. Recommendations tor Curri 

Materials. Curr isahan Theory Network (Dept. of Curriculum, Ontario Institute 


Urmson, J. D. On Grading. Philosophical Essays on Teaching (Edited by Bertram 
anini and Robert S. ahan) Philadelphia: Lippincott, 1969. Pp. 194-217. 

Weikart, David P. Comparative Study of Three Pre-School Curricula. Paper presented 
to biennial meeting of Soci or Research in Child Development, Mar. 1969. 
Ypsilanti, Mich.: Ypsilanti Public Schools. (Mimeo.) Ț i 

Weiss, Joel. Development of Model and Procedures for Curriculum Evaluation, Toronto: 
rept of Curriculum, Ontario Institute for Studies in Education, Jan. 1969. 
(Mim 


Welch, anA W. and Walberg, Herbert J. A Design for Curriculum Evaluation, 
Sci Educati : 10-16; 1968. 
Westie eatin em Search of Disciplined Attention, or A Discipline in Search 
of Tis Problems: A Discussion of Some Assumptions in a Curriculum Theory. 
Paper presented to annual meeting of American Educational Research Assoc., Feb. 
1969, Chisago: Univ. of Chicago. (ERIC: ED 80. : 
Wiley, David E. The Design and Analysis of Evaluation Studies: Comments and 
EE Paper presented to Symposium on Problems in the Evaluation of 
Dec. 1967. CSEIP Occasional Report No. 28. Los 


I i i > 4 
Aupen Via N Parag. S for the Study of Evaluation of Instructional 


Programs, 1968, i 
Wise, Arthur E. Rich Schools, Poor Schools. Chicago: Univ. of Chicago Press, 1968. 


259 


REVIEW OF EDUCATIONAL RESEARCH Vol. 40, No. 2 


Wolf, Richard. A Model for Curriculum Evaluation, Psychology in the Schools 6: 
107-108; 1969. 


Ziman, John. Public Knowledge. Cambridge: Cambridge Univ. Press, 1968. 


Additional References 


Corazzini, Arthur J. The Decision to Invest in Vocational Education: An Analysis 
oe Costs and Benefits. The Journal of Human Resources 3, Supplement: 88-120; 


Boa ga aton Handbook of Industrial Research Management. New York: Rein- 

old, i 

Lortie, Dan C. The Balance of Control and Autonomy in Elementary School Teaching. 
The Semi-Professions and Their Organization. (Edited by Amitai Etzioni.) New 
York: Free Press, 1969. Pp. 1-53. 

Maguire, Thomas O. Decisions and Curriculum Objectives: A Methodology for 
Evaluation. Alberta Journal of Educational Research 15: 17-30; 1969. 

i no Shone; I. Education and Poverty. Washington, D. C.: Brookings Institu- 
tion, 


Seiler, Robert E. Improving the Effectiveness of Research and Development. New York: 
McGraw-Hill, 1965. 


AUTHOR 


IAN WESTBURY Address: University of Chicago, Chicago, Illinois Title: Research 
Associate; Assistant Professor Age: 30 Degrees: B.A. and Dp. Ed., University of 
Melbourne; Ph.D., University of Alberta Specialization: Curriculum. 


4: VALUES, GOALS, PUBLIC 
POLICY AND EDUCATIONAL 
EVALUATION 


HAROLD BERLAK* 
Washington University 


Assassinations, riots, demonstrations, pro- 
longed and bitter political confrontations, and moralistic exhortations 
characterized the decade of the 1960’s. Was it the worst of times or a 
prelude to the best of times? If the world survives, some unborn historian 
or social critic may be better able to pass judgment. But two things are 
clear as the decade of 1970 begins: there are few areas of American life 
where attitudes, values, goals, priorities, policies, and programs are not 
being challenged, and there is increasing impatience with dispassionate 
rational discussion among contending groups. 

Education is no exception. Added to the older and continuing con- 
troversies over racial segregation, federal aid to the school, sex education, 
and public aid to parochial schools are newer disputes over community 
control and the discrepancy between rich and poor school districts. Decisions 
heretofore relatively insulated from politics are now being attacked by 
parents, political pressure groups, teachers, and minority groups. Curricu- 
lum adoption, grading policy, school reorganization, tracking, composition 
of classes, promotion and assignment of school personnel, and the selection 
of pom-pom girls are now political issues. The policy maker’s basic goals 
and values, whether openly professed, implicit, or falsely attributed, are 
being questioned. Often the charge of the challengers is that the “system 
or its leaders are pursuing goals based on underlying values which are 
improper, and some change in values, leaders, programs, or the controlling 
constituency is necessary. 

The growing politicalization of all aspects of American life coupled 
with an increasing retreat from rationalism and distrust of the intellectual 
is a direct challenge to applied social scientists and other professionals 
to demonstrate that social justice is served by rational, empirically based 
policy decisions. The emerging field of educational evaluation is in part 


ae 


*I wish i tribution of Ann C. Berlak, Washington University, 
who n Cae this chapter, and of Michael Scriven, University of 
California, and Louis M. Smith, Washington University, whose ideas unsettled and 
provoked mine. 
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a product of this recent political history. If educational evaluators fail 
to make a significant contribution to rational policy making, or if they 
make exaggerated claims of what they can do, then there is a danger 
that their models and strategies will be discounted by policy makers and 
the public. What can the field of educational evaluation and individual 
evaluators contribute to the resolution of the value conflicts embedded in 
educational policies and the disagreements over basic goals? My purpose 
in this chapter is to identify a number of epistemological, practical, and 
ethical problems related to this question and to suggest possible solutions. 


Programatic and Public Policy Outcomes 


Governmental and non-governmental institutions and agencies allocate 
resources for the performance of tasks either precisely or vaguely defined; 
in other words, they establish programs. The organization of men and 
materials for the purpose of designing and producing the H-bomb is by 
this definition a program, as is the effort of a hospital to deliver in-patient 
health care to the poor, or the efforts of a school board to establish a 
vocational high school or introduce a plan for school desegregation. The 
policy makers, in deciding whether to institute or to continue a program, 
may examine broad questions or limit themselves to narrow ones. For 
example, if the school board were to evaluate the outcomes of their recently 
established vocational high school, the narrower questions might include: 
are manpower needs in the local communities being met? How well have 
the students mastered the intended occupational skills? Are the graduates 
of the school being employed in positions commensurate with their training? 
The broader questions might include: Will a vocational high school sepa- 
rated physically from the comprehensive school rigidify social class distinc- 
tions and reduce mobility? Will a dual high school system reduce communi- 
cation and increase misunderstandings between the college educated and 
the blue collar workers in the community? Sets of broader and more limited 
questions can be generated for evaluating a desegregation plan, a mathe- 
matics curriculum, or virtually any educational program. 

I will label the narrower set of questions programatic and the broader 
questions as public policy* issues. Whether a policy or program raises 
public policy issues is determined not on the basis of who initiates the 
program but on the basis of the nature of the effect it may have on indi- 
viduals and the society. Thus programs initiated by a private university 
may raise issues of public policy as readily as those initiated by public 


“Another label for these broader questions would be political issues. However the 
term political in the American context has connotations I wish to avoid. The designa- 
tion political would be appropriate if we thought of it in the classical Greek sense— 
that all questions which concern the society as a whole are political questions without 
the connotation of partisanship and special interest. 
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institutions. There is a certain degree of arbitrariness in this distinction 
as there is in all such distinctions; however, it may help clarify a number 
of issues related to evaluation in the areas of values and goals. 


Criteria for Identifying Public Policy Questions 


The failure to make a distinction between programatic outcomes and 
public policy outcomes has led to some confusion in the educational 
evaluation literature. In general, educational evaluators have confounded 
narrower programatic questions with broad public policy issues or have 
failed to recognize the latter issues. At the risk of confounding the distinc- 
tion, I offer four tentative criteria for identifying public policy issues in 
educational programs. 


1) Does the program directly or indirectly alter the power relation- 
ship between the citizen and the state? This question is meant to include 
such things as the ability of the citizens to influence policy making and 
the administration of justice. There are for example numerous curricula, 
especially in the social studies, which posit very broad goals of affecting 
the transactions between citizens and state, usually not in the immediate 
but in the distant future. Though curriculum evaluators operating within 
the Tylerian tradition are likely to be distressed by such grand objectives 
because they cannot be easily operationalized, these broad goals must 
not be discounted. It is often on the basis of such grandiose claims (see 
Chapter 1) that the public policy makers institute their programs and 
not on the basis of whether the program achieves some narrower 
programatic objectives. 


2) Does the program affect immediately or in the long run the 
status a person has and the power he can exercise within the social 
system? To alter or to attempt to alter an individual’s status or power 
with respect to any institution or group is to expand or contract his freedom 
to exercise alternatives, e.g., his ability to earn a living. There are many 
educational programs—curricula, school reorganization plans, special “op- 
portunity” programs—which accidentally or intentionally affect the oppor- 
tunities and access that an individual has within a society. 


3) Does the program have any effect which tends to increase or 
decrease political or social tensions? A program instituted by a policy maker 
may have minimal programatic intent or relevance; its primary concern 
may be political—to resolve an impasse in the legislature or a dispute among 
pressure groups (see Cohen’s discussion of Head Start in Chapter 2). 


4) Does the program effect a change in the self-concept or sense of 
self-worth of the individual? For example, a program designed to encourage 
children to read books using a form of extrinsic reinforcement (praise, 
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tokens, food) may achieve its goal but cause some children to develop self- 
concepts as pawns. If this were the case, then the program would raise 
a public policy issue. 

An objection to these criteria is that they appear to make public policy 
questions of all educational programs. The response to this argument is that 
most educational policies and programs do indeed contain latent public 
policy issues. And the nature of the American political system is such that 
any group of citizens who feels strongly enough and is able to organize 
effectively can precipitate a dispute over any program. 


The Boundary Problem 


Public policy and programatic outcomes may be intended, that is 
specifically sought; unintended and anticipated that is, not specifically 
sought but nonetheless expected; or unintended and unanticipated, that is 


neither sought nor expected. The array of outcomes may be represented 
as follows: 


Realms of Outcomes Unintended Outcomes 


Anticipated Unanticipated 


Programatic 


The matrix helps distinguish between the concept of “objectives” and 
goals.” Public policy—intended outcomes is what I believe most people 
mean when they speak of broad “goals.” Programatic—intended outcomes 
is synonymous with the Tylerian-Bloom concept of objectives. 


Examples of the type of outcomes which may fall into each cell will 
help clarify the matrix.* Public policy—intended: a curriculum developer 
may devise a social studies curriculum program which is intended to have 
some influence on the capacity of society to preserve and foster the dignity 
of the individual (although he may hold that the influence of the curricu- 
lum is at best indirect and that other factors outside of education have 
greater influence). Public policy—unintended-anticipated: an evaluator 
may find that a program which included an examination of segregation in 
housing, when introduced into a community with a history of racial strife, 
reduced the political tensions within that community. This outcome may 
not have been specifically sought but it may have been expected. Public 
policy—unintended-unanticipated: the curriculum may include materials for 


“This is a hypothetical example but it is based ial Studies 
Public Issues Curriculum (see Oliver and hayer T ie PEO 
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the teacher which recommend use of a “socratic” style of teaching. If used 
consistently, this style may affect students’ self-concepts in unpredicted 
and unintended ways. Programatic—intended: the objectives of the curricu- 
lum may be to develop the capacity of students to analyze controversial 
issues through the learning of a set of intellectual skills and the mastery of 
a set of substantive concepts. Programatic—unintended-anticipated: though 
not specifically sought, the curriculum developers may expect that the use 
of the curriculum will encourage students to discuss controversial political 
issues with parents and with peers outside the classroom. Programatic— 
unintended-unanticipated: one wholly unanticipated consequence of the 
program may be that the students are able to remember more historical 
content than students who have used a conventional history text. 

The examples are only suggestive of the vast array of outcomes which 
could be evaluated. The diversity of outcomes raises for each evaluator what 
I shall call the boundary problem. Resolution of the boundary problem 
includes a recognition of the limitations of his competence and a judgment 
about his responsibilities, Models and strategies must be employed which 
are appropriate to the multiplicity and diversity of outcomes in each of 
the cells of the matrix. Since it is unlikely that any single evaluation expert 
possesses models and strategies appropriate for every cell, the evaluator 
must determine whether his models and skills are appropriate for a given 
class of outcomes. An analogy can be made to the physician who should 
know what he is competent to do and the limitations of his knowledge and 
skills. Though the physician cannot be expert in all areas, he must have 
sufficient knowledge of the patient and of the range of potential problems 
if he is to make an intelligent referral. Similarly, the evaluator in each 
case must make a judgment whether his models and skills are applicable 
to the case or whether he must seek the skills of another evaluator. 

The boundary problem for the evaluator is complicated by the fact 
that as he enters a problem he may not at that time be able to define the 
boundaries of his responsibility. It is neither desirable nor necessary to 
conduct a comprehensive evaluation of every educational evaluation pro- 

i have the skills necessary to deal with 


gram. Although an evaluator may s t y 
one or more cells in the matrix, he may recommend for financial, political, 
st be set which in some cases 


or other considerations, that priorities mu 
exclude the use of his special competence. For example, an evaluator 
grounded in the Tylerian tradition may find as he becomes acquainted with 
a particular program and setting (a given school, district, state) that an 
examination of the intended—programatic outcomes while desirable should 
occupy a low priority. He may recommend that the policy makers at that 
point are better served by an examination of the unintended or intended 
public policy outcomes for which he has very limited skills. What should 
be the boundaries of the evaluators’ responsibility is a complex question 
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especially in the public policy area, It raises a fundamental question of 
what should be the role of the expert in a democratic state. I return briefly 
to this issue in the last paragraphs of the chapter. 

To Describe or to Judge? An expert called upon to evaluate a policy 
or program can take at least one of three stances in approaching his task. 
He may suggest what sorts of data may be relevant to the policy makers 
in making their judgment, and how one should proceed in data collection, 
summarization, and interpretation. This is the descriptive position advocated 
by Stake (1967). The expert may make a judgment, that is, he may clearly 
say which program is better or worse than another, or whether the program 
in existence is a wise one. This is the position recommended by Westbury 
in Chapter Three, Glass (1969), and Scriven (1967). The third position 
is that the evaluator suggests explicit criteria but does not make a judgment. 


The position taken in this chapter is that the expert must set boundaries 
for a given evaluation task, and his determination of whether he will 
describe, recommend judgment criteria, or render a judgment depends upon 
whether he is operating primarily in the public policy or programatic 
portion of the matrix. This position is argued in the next several paragraphs. 


The Moral Issues in Public Policy 


The argument in the remainder of this paper rests on a relatively non- 
controversial assumption—that disagreements on public policy issues often 
include within them real or presumed differences in moral values.* If this 
is true, and if evaluation as Westbury contends implies “willingness to 
apply criteria and judge a curriculum (and presumably other educational 
programs) good in some way or other” (Chapter 3), then clearly the 
evaluation expert if he concludes within his boundaries the public policy 
cells of the matrix, must be able to demonstrate that he has some criteria 
and strategies which, if applied, will enable him to resolve moral questions. 
Even if one accepts the more limited view that evaluation should merely 
provide judgment data, he must still ask whether the types of models and 
strategies discussed by Stake in Chapter 1 are adequate for identifying and 
summarizing the kinds of data relevant to rendering moral judgments. 
Clearly an expert who sees his role as helping to resolve differences over 
educational goals must be prepared to make a contribution to the resolu- 
tion of differences in moral judgments. He must be prepared to provide 


*I distinguish between values and moral values in the following way. A value is a 
belief or conjunction of beliefs which guide human behavior. Moral values differ in 
that they are beliefs that establish ideals or standards for action, Thus vigilantism has 
been called a value common to most Americans (see McCord, 1960), but the latter 


is not a moral value, Equality, honesty, and human dignity would be classified as 
moral values, 
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data which will facilitate discourse over moral judgments or be prepared 
to demonstrate that he himself is an expert in rendering moral judgments. 

I made an effort to determine to what extent the literature in the field 
of educational evaluation provides the models and strategies appropriate for 
dealing with moral issues in educational programs. An examination was 
made of the past three issues of the Review of Educational Research on 
curriculum, the Proceedings of the Invitational Testing Conferences of 
the Educational Testing Service, the usual set of educational yearbooks, and 
a sampling of the recent writings in meta-evaluation. The most common 
and carefully developed model offered in the literature is the Tylerian or 
behavioral specification of objectives approach and its major use is for 
analyzing programatic—intended outcomes (see Stake and Denny, 1969). 
Glass (1969) examined the strengths, limitations and weaknesses of the 
Tylerian, Accreditation, Management Systems and Summative-Composite 
models (also see Alkin, 1967; Guba and Stufflebeam, 1968; Scriven, 1967). 
Stake, (1967; also Chapter 1 in this issue) described a comprehensive model 
for collecting judgment data. In these comprehensive models there is a 
recognition of the broader questions in educational evaluations; however, 
the strategy recommended for handling values was usually the collection 
of preferential values espoused or implied by policy makers, program devel- 
opers, or other groups. What is lacking in the literature is careful treatment 
of the epistemological problems of assessing moral issues or a discussion of 
how one might render judgments where there are differences in espoused 
or implicit moral values. 

In general, I found little to justify any confidence that the field of 
educational evaluation as an applied social science possesses the models, 
strategies, or techniques for contending with the moral component in educa- 
tional decisions. This criticism is not intended as a condemnation of the 
literature. There are a growing number of theoretically interesting articles 
in the field (witness the chapters in this issue of the Review) but they are 
mainly directed at problems, issues, and model development related to the 
programatic realm. It is noteworthy that the few articles which directly 
treat the epistemological issues of resolving differences over moral values 
were not written by scholars whose primary field is educational evaluation 
(Cohen in this Review; Scriven 1966b; and Smith 1966). 


Goodlad (1969, p. 369), commenting in a recent issue of the Review 
of Educational Research on the state of the field of curriculum, wrote: 


The authors of these chapters represent a relatively new and 
growing group of scholars—theorists and researchers in the field 
of curriculum. They are primarily (but not exclusively) concerned 
with advancing their chosen field of scholarship and only sec- 
ondarily with curricula as they exist in thousands of schools and 
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colleges. Consequently, it is neither surprising nor necessarily a 
bad thing that what they write about has no one-to-one relation- 
ship to what students and even teachers are above their ears in 
each day. If the abstract categories of research and discourse with 
which those scholars deal bear no identifiable relationship to the 
existential phenomenon called curricula, then there is, indeed, 
cause for concern. 


Change the word curriculum to educational evaluation in the first several 
sentences and the last sentence to read “no identifiable relationship to 
moral issues in educational evaluation,” and the quote aptly describes the 
state of the field of educational evaluation. 


If what I have said about the general absence of models and strategies 
within educational evaluation for dealing with moral issues is correct, then 
I believe there are at least two directions the field can take. First, the field 
could move toward the development of such models. This is the direction 
recommended by Scriven (1966b) who argued that properly trained social 
science specialists—and presumably this would include educational evalu- 
ators—could and should be equipped to decide the moral wrongs of issues. 
Thus Scriven explicitly recommended that there may be some evaluators 
who, if properly trained, could take the stance suggested by Westbury 
and Glass, that is, rendering or recommending criteria for public policy 
judgments. However, Scriven implied that most evaluators as weli as other 
social scientists are not at present able to take this role. Of course, the 
question remains as to what is a properly trained applied social scientist. 
Scriven offered several examples but not any clearly rationalized and formu- 
lated answer. A second alternative for the field of educational evaluation 
(though not necessarily for every evaluator) would be to keep out of the 
realm of morals. If the field moves in this direction the models and strategies 
developed would only apply to the programatic area. If the first alternative 
is chosen, then there must be further development and elaboration of 
models appropriate for the evaluation of moral issues. In order to hasten 


such development, evaluators might look to fields outside of educational 
evaluation. 


Moral Philosophy as a Source of Models 


Although educational evaluators have dealt very little with the nature 
of discourse in moral issues, philosophers for many years have contended 
with the epistemological issues related to ethical disagreements. (See Brandt, 
1959; Dewey, 1960; Foot, 1967; Nowell-Smith, 1954; Scriven, 1966a; Sellars 
and Hospers, 1952; Warnock, 1960.) Moral philosophy is a field with a 
long and distinguished history and when a layman like myself attempts to 
summarize positions and suggest implications he is apt to blunder into 
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oversimplifications. With this qualification I will proceed to do injustice 
to the field. 

There are two central and related questions in moral philosophy 
which are directly relevant to the problem of establishing some criteria for 
rendering moral decisions and for devising strategies applicable to educa- 
tional public policy decisions. 1) What is the meaning of words such as 
“right” “wrong” “good” “bad” “wise” “unwise” and what is the nature 
and meaning and function of statements in which these words occur? 2) Can 
moral judgments be proved, justified or shown invalid? If so, how and in 
what sense? Stated differently, can moral value judgments be justified in 
an objective way?* 

There are at least three types of responses to these questions and each 
seems to suggest something different about the approach which can be 
taken by evaluators who want to make a direct contribution to resolving 
the moral issues embedded in educational policies and programs. 

The first position is that one cannot derive an “ought” from an “is.” 
The effort to use a “scientific” approach to resolve ethical issues is doomed 
to failure, and to make such an attempt is to commit the “naturalistic 
fallacy.” Searle (1967) characterized the position as follows: “no set of 
statements of fact by themselves entail any statement of value. Put in more 
contemporary terminology, no set of descriptive statements can entail an 
evaluative statement without the addition of at least one evaluative premise 
[italics added].” Though there are many versions of this position, all 
accept a dualism between the realm of science and the realm of value. 
This position on ethics is fairly common among social scientists, including 
educational evaluators, but is usually implicit. The evaluator who accepts 
this position and yet insists that he must be prepared to judge or recommend 
criteria for judging the worth of educational policies is put in an awkward 
position. If he asserts that his program judgments are objective or warranted 
in any way then he is making a contradictory claim. Of course he may 
respond that the criteria he is using for such moral judgments are not 
objective but merely represent his effort to make explicit his most basic 
moral judgments (i.e., his evaluative premises). If he makes that response, 
the policy makers who employ him may simply assert that they reject the 
expert’s evaluation of the public policy issues because they hold differing 
value assumptions and prefer to seek out experts who hold value assumptions 
consistent with their own. Thus, if the criteria for judging policy issues _ 
are not objective, then the judgments of the expert are relative, and it is 
impossible for him to defend his judgment on rational grounds. 

A second position is that all moral differences can be resolved by 
rational and scientific means, (e.g., see Geiger, 1961). Scriven (1966a, 


“These questions are adapted from Frankena (1963). 
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1966b) for example argued this position and attempted to show that rational 
strategies used in social and behavioral inquiry, though not identical, are 
similar to those which may be used for resolving moral disputes. Scriven 
argued his view persuasively, but he by no means solved all the method- 
ological problems in the evaluation of public policy outcomes. It may be 
possible to develop models and strategies along the lines Scriven suggests, 
but it will be a long time before such models and strategies are fully devel- 
oped and shown to be useful. And even if such models now existed, we 
are a long way from having a cadre of evaluators capable of using them. 

A third position on moral decisions is that some but not all aspects of 
ethical disagreements can be reduced to scientifically warranted justifica- 
tion (e.g, see Stevenson, 1944). Educational evaluators who accept the 
third position have problems of operationalizing their ideas which are 
similar to those who accept the second position. In addition, evaluators 
who accept the third position have the problem of determining criteria for 
deciding what component or types of moral disagreements are not amenable 
to rational discourse. 

I accept the third view but I do not intend to argue its merits in this 
chapter. The solution I propose for resolving ethical issues in educational 
programs is based on the position advocated by Oliver and Shaver (1966). 
However, in its present form their solution does not apply to educational 
policy issues. My suggestion is to identify a set of core ethical values of 
American education and through a process of argumentation by analogy, 
establish relative priorities among conflicting moral values for the case 
(ie. program) under consideration. For example, one basic value of 
American education is “individual choice in the development of one’s 
interests,” a second is “cooperation among individuals and groups.”* A 
judgment on the merits of a particular educational program could require 
resolution of this value conflict. By a process of reasoning through analogy 
the policy makers can determine the degree to which they are willing to 
violate one or the other of these values; thereby they may reach some 
agreement on the worth of the program. This general position is at an 
early stage of development. Some efforts to deal with the public policy 
issues in social studies curriculum decision-making have been completed 
(see Berlak and Tom, 1967; Berlak and Shaver, 1968; Tom, 1969). 


Other Sources for Models 


Educational evaluation distinct from educational measurement has 
had a short history. Until recently practitioners and theorists in educational 
evaluation were almost solely trained in psychometrics; they approached 


*I am not able at this point ti il re 
values are identified pap tot the TET og liad utes aes a 
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educational evaluation problems with psychometric models, appropriate for 
individual assessment of human characteristics, but with little concern 
for the epistemological, practical, and ethical problems of relationships 
of social knowledge to public policy. Glass (1969), Stake and Denny (1969) 
among others have clearly shown that the evaluation of educational pro- 
grams, though it may include it, is not identical with individual assess- 
ment, but the discussion of policy problems in the educational assessment 
literature is scant compared to other social sciences—economics, sociology, 
political science and social psychology. 

In these other social sciences there is a tradition of scholarship con- 
cerned with the relationship of theoretical knowledge* in the social sciences 
to social problems. In the Journal of Social Issues, The Public Interest, 
and the Journal of Conflict Resolution, one can find papers on the broad 
array of questions and problems associated with the relationship of theoretic 
knowledge to public policy which are relevant to the emerging field of 
educational evaluation. In social psychology, men in the tradition of Allport, 
Lewin, and Lippitt continue to make contributions for understanding the 
relationships of social values and social policy (for example, see Bennis, 
Benne, and Chin, 1961). In a volume commissioned by the American 
Sociological Association (Lazarsfeld, Sewell and Wilensky, 1967), indi- 
vidual contributors attempted to come to terms with a large num! 
issues related to the uses of sociology. It is beyond the scope of this paper 
for me to undertake the formidable task of analyzing the extensive literature 


relationships of this work to the field of educational evaluation. Clearly, 
it is an arduous task which remains undone, but it could lead to extensive 
reconceptualization or synthesis of existing models in educational evaluation. 

Although applied social scientists have dealt with a wide array of 
issues, only recently have the academic disciplines been called upon to 
contribute to the special problem of rendering evaluative judgments of 
social action programs. It appears from the September 1969 issue of the 
Annals of the America Academy of Political and Social Science devoted 
to evaluating the War on Poverty, that other social scientists share with 
educational evaluators a similar set of uncertainties and problems especially 
in the moral realm (see Ferman, 1969; Weiss and Rein, 1969). From a 
cursory reading of the literature, my judgment is that the social sciences 


*See Sch f istinguished between theoretic disciplines, which he char- 

sid concerned, wih knowledge, and the EEE, tabi deo Using 
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cerned with choice and action and w ationship between the theoretic disciplines and 


social probl is li aise some unique epistemological, ethical, an practical 
problems. othe practical disciplines (in which I would include educational evaluation) 
may share some of the former problems, but they also present some unique problems— 


a few of which have been raised in this chapter. 
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are also at an carly stage in the building of empirically based theories and 
the development of models and specific strategies which can be used for 
making moral judgments. A promising approach to this problem is for 
social scientists generally including educational evaluators to study the 
existential world of decision making. 


Naturalistic Studies of Educational Decision Making 


The call for the naturalistic study of “the way it is” was made by 
Goodlad (1969) and by Schwab (1969). There are three types of natural- 
istic studies of what educational decision makers—from teachers to congress- 
men—do which may contribute to building models, developing strategies, 
and rendering judgments on particular programs. 


(1) There are men of practical affairs in education who from their 
own experience have become experts in making or presiding over the making 
of moral judgments. Some: of these experts know how to conceptualize in 
abstract language what they do and are able to identify and justify their 
criteria and strategies of moral decision making. From interviews and care- 
fully conceived observational studies of their activities, one may be able to 
develop or clarify models and strategies. Glaser and Strauss (1967) made a 
similar suggestion for the development of grounded theory in the social 
sciences. There are other men of practical affairs—teachers, superintendents, 
or congressmen—who may be operationally expert in rendering wise deci- 
sions, but who have no interest in or ability to describe and conceptualize 
their activities. Social scientists operating in the mode of anthropological- 
field-study may be able to derive principles and criteria for moral judgments 
from the study of the behavior of such men. 


(2) Studies of both expert and typical decision-making behavior in 
educational settings will produce descriptions which can lead not only to 
further refinement of models and strategies for programatic evaluation, but 
also to the development of models and criteria for moral decision-making 
(see Gouldner, 1957). Accurate accounts of how decisions are made are 
necessary if the evaluator’s recommendations are to be seen as appropriate, 
relevant, and useful by the decision maker. The expert who does not under- 


stand how the decision makers perceive their problems may find that his 
prescriptions are discounted by them. 


(3) Naturalistic studies of the way the decision makers arrived at 
each particular program decision may be necessary, if the evaluator is to 
design the appropriate evaluation study for that program. The unique 
characteristics of a particular program may make some models or strategies 
inappropriate or irrelevant to the programatic or bublic policy outcomes of 
that program. The failure to gather such data may cause the evaluator 
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to evaluate the wrong outcomes. For example, Cohen (Chapter 2) argued 
that educational evaluators attempted to assess the Head Start program 
in terms of learning outcomes. He pointed out that Congress may have 
had other public policy outcomes in mind, and not the improvement of 
children’s achievement. Whether Congress was wise in its intentions is 
a separate question, in part, a moral one. 

Smith and Geoffrey (1967), Smith and Keith (1967), and Jackson 
(1968) showed the power of naturalistic studies in educational settings. 
Further studies of this sort may stimulate progress in the development of 
appropriate models for evaluating public policy issues in education. 

As new fields of knowledge emerge they require new language; the 
development of a new language in a practical discipline presents special 
problems. Goulder (1957) argued that if knowledge from applied social 
sciences is to be useful to men of practical affairs, the theories must contain 
elements that can be reconceptualized into lay terms. The abstract concepts 
and models of educational evaluators need to be readily understood by 
the educational decision makers and be capable of being translated into 
familiar language. The burden of making a model or set of concepts 
understood is on the social scientists and not on the laymen. Jackson (1968) 
observed that teachers habitually use simple language. There is, I believe, 
an unfortunate tendency for educational theorists and researchers in evalua- 
tion to convert this observation into a rationalization for their failure to 
conceptualize their ideas in ways which can be understood by teachers 


and administrators. 


The Role of the Evaluator in Public Policy: Some Conclusions 


In the immediate future, many educational evaluation specialists 
are capable of defining within their boundaries only programatic outcomes. 
The stance suggested for evaluators by Glass, Scriven and Westbury is, I 
believe, appropriate for making assessments in the programatic realm. 
However, in making assessments in the public policy realm, perhaps Stake’s 
position that evaluators should confine themselves to descriptions would 
in most cases be more reasonable, given the absence of models and 
strategies in the realm of public policy evaluation and the difficult theoret- 
ical and practical problems which are raised in making warranted moral 
decisions. In general there is no single group of professionals trained to 
act as comprehensive experts on policy matters in education. But programs 
exist and there is the need to evaluate. At the present time social scientists 
from several fields, philosophers, journalists, ethical theorists, educational 
evaluators, and educational decision makers, can be called upon to provide 
a piece of whatever special insight, concepts, or data they may have 
which will contribute to making more rational the decision on a particular 


program under consideration. 
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The distinction between public policy outcomes and programatic out- 
comes, though relatively clear in general cases, may not be so clear in 
concrete instances. For example, an evaluator who is attempting to under- 
stand and evaluate programatic outcomes may discover that they are diffi- 
cult to distinguish from public policy outcomes, or he may discover 
inconsistencies among the proclaimed public policy intentions, the pro- 
gramatic intentions, and the way the program has been operationalized. 
Certainly there is nothing suggested in these pages which is meant to imply 
a rigid jurisdiction for educational evaluators. I do feel, however, that the 
educational evaluator should enter into the evaluation of public policy 
and ethical issues with the caution appropriate for a man who is not sure 
that his tools and techniques are appropriate to the task. 

Conversely, the philosopher, social scientist, journalist, etc., serving 
as an expert on a public policy issue may find that it is necessary to exam- 
ine the conduct of the program in the more traditional sense in order to 
help assess the public policy issues since in many, perhaps most, cases if 
one does not know whether the clients have mastered certain intended 
programatic outcomes, the public policy issues are irrelevant. However, when 
the non-expert in programatic evaluation enters this field, he too must 
move with suitable caution because he also is not likely to have the models 
and strategies necessary for clarifying the criteria for judgment or for col- 
lecting, analyzing and summarizing the judgment data. 

Finally, policy making agencies may be fully justified in evaluating 
a program without collecting any performance data. The program could 
be rejected as morally wrong whether or not it accomplished its intended 
Programatic outcomes. It also may be true that programatic outcomes for 
a given program show no significant changes in student learning, yet the 
program may be judged successful on public policy grounds. For example, 
the decentralization of a school system may be judged successful because 
it calmed a potentially explosive political situation within a city, yet its 
outcomes in traditional achievement terms may be negligible. 


A Hypothetical Example 


The following hypothetical example is intended to clarify some of the 
arguments made in this paper. Its purpose is also to demonstrate how 
educational evaluations might proceed in the public policy realm in the 
face of the existing epistemological and practical problems. 

Assume that a school system has introduced a Computerized Arith- 
metic Program (CAP) into the system. Also assume that CAP was developed 
elsewhere and was implemented in the school system on a small scale. A 
decision must be made on whether the resources of this school should be 
committed to implementing CAP throughout the system. CAP has the 
programatic-intended outcomes (or objectives) of providing each child with 
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“individualized arithmetic drills on the basic processes of addition, sub- 
traction, multiplication, and division and further, these skills are intended 
to be learned at differing rates by virtually all children who participate in 
the program. An additional programatic-intended outcome of CAP is that 
teachers will have additional time for teaching the students more complex 
mathematics. The developers may also have some programatic-unintended- 
anticipated outcomes (e.g., they may expect teachers will be initially hostile 
to the use of a machine). 

In the program the child spends between 2 and 10 minutes per day 
at a computer console responding to arithmetic problems delivered by 
the computer to the child. The number and complexity of the problems 
delivered to the child depends upon the number of his correct responses 
at a given level. 

Judgment data for determining the success of the program could be 
collected in many of the areas suggested by Stake. For example, the pro- 
gramatic evaluation could entail an effort to assess the arithmetic achieve- 
ment of the students in the program and an effort to determine whether, 
as intended, the teachers spent more of their time on more conceptually 
complex mathematical instruction. A number of programatic-unintended- 
unanticipated outcomes might be discovered, for example, that teachers 
have abandoned the teaching of science since the introduction of the 
program, or that students who have completed several years of the program 
have a greater tendency to elect advanced mathematics in high school or 
college and so on, One or more evaluation specialists may collect and 
interpret such data to be used by school officials in making a decision. 

Consider now the public policy realm. The program developer may 
have had no public policy—intended outcomes. But suppose that the 
educational evaluator asked a social psychologist or anthropologist to make 
some observation in the school where CAP is being used. This expert using 


field notes as a primary source of data finds that the children in the program 


have invented a game. As they participate in the program, the children 
the computer 


keep careful track of the number of minutes they spend at 

console; the children compete with each other to see who can spend the 
least amount of time at the console. The game may have been wholly 
unobserved by school administrators or teachers or seen as a harmless 
diversion. The social psychologist, by training alert to some of the problems 
of society which can arise from excessive competition, has uncovered a 
public policy question—should the school system continue a program which 
seems to reinforce competitive values. Ignoring for the moment other 
programatic questions, one can see that there is here a public policy outcome 
as I have defined it, though it was neither intended nor anticipated by the 
developers or the school officials. Whatever may be the mathematic achieve- 
ment reached by the children, a policy maker may hold that the reinforce- 
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ment of this value is so morally reprehensible that the program should be 
abandoned. 

Can a discussion on this moral issue proceed further without resorting 
only to personal preferences? It is possible to approach this question ration- 
ally, although most educational evaluators are usually not prepared to do 
so alone. For example, the educational evaluator (or decision maker) might 
consult with experts from related disciplines. Social psychologists who have 
conducted studies of competitive behavior in other contexts may be able 
to contribute to the understanding of the effects of competitive behavior 
among adults in differing social settings and thereby suggest relevant 
criteria for evaluating the effects of this unintended influence on the so- 
ciety’s social system. It may be that the social psychologist is able to distin- 
guish several types of competitive behavior with variable sets of behavioral 
consequences and to demonstrate that the sort of competition reinforced 
by this program appears to be relatively inconsequential in terms of other 
social behaviors. Developmental psychologists may also be consulted to 
contribute their understanding of the effects of competitive games by 
peers on the moral development of children, and they may have something 
useful to say about variable effects at different stages of ego development. 
A clinical psychologist may be able to estimate the effect that such games 
may have on the normal and disturbed child, (e.g., what are the effects 
on self-concept?). Political scientists who have studied political socializa- 
tion may be asked to conduct a study or, on the basis of previous research, 
to suggest how characteristic patterns of child behavior affect participation 
in politics as an adult. 

. The issue of the effects of competition on the child in society obviously 
is complex. The problem of attempting to determine its desirability is 
equally difficult. What I have tried to show by this example is that public 
policy questions are to some degree answerable through the application 
of social science knowledge to the issue, After the strategies recommended 
in the previous paragraph have been used, a moral dispute may remain. 
According to the suggestions offered on page 19, one could attempt to 
identify the core educational values implied in this dispute. If one assumes 
competition” and “cooperation” are core educational values, and both 
should be fostered in school, then there is an obvious value conflict. 
One could employ a process of value clarification through the use of analogy 
which I believe may succeed in moving the policy makers toward agreement 


on priorities they believe should be assigned to these values in this particular 
instance. 


Policy Questions and Elites 


The question of the boundaries of responsibility of the educational 
evaluator has been raised. What should be the role of an expert in a free 
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society is a central issue that must be considered by all educational evalu- 
ators who attempt to influence public policy. Experts are obviously increas- 
ingly necessary in a technologically-complex industrialized urban society; 
the question of their proper role is as old as debates on democracy itself. 
Only a hopeless romantic would deny that experts must be consulted in 
some way when formulating or evaluating policies and programs dealing 
with the delivery of medical care, the use of atomic power for the produc- 
tion of electricity, or the formulation of foreign policy goals and strategies 
and so on. But the expert must remain responsible to the ultimate source 
of all legitimate authority in a democracy—the people. Though we cannot 
survive without experts, they can also do us in. I am suspicious of im- 
modest experts; experts by virtue of their expertise certainly do not possess 
superior moral values. Critics of democracy have long pointed out that the 
hazard of a democratic system is that the people may not choose the 
wisest men to govern. This unhappy consequence disturbs me less than 
the expert who presumes he knows best what is good for society. 
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5: EVALUATION OF CLASSROOM 
INSTRUCTION 


BARAK ROSENSHINE* 
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In this review, an attempt is made to describe 
available instruments for the observation of classroom instruction and to 
suggest modifications for local evaluation of instruction. Four potential 
uses of these instruments are described and examples are given of each: 
assessing variability in classroom behavior, assessing whether the teacher's 
performance agrees with specified criteria, describing classroom interaction, 
and determining relationships between observed classroom behavior and 
outcome measures. Finally, several difficulties in the use of observational 
instruments and in the interpretation of the results are noted. Major 
emphasis is given to the evaluation of instruction within specific curricu- 
lum projects, that is, programs in which the instructional materials were 
developed by special groups such as the Biological Sciences Curriculum 
Study. The term curriculum refers to the instructional materials and 
the suggestions for their use; the term instruction or instructional program 
refers to the interaction among teachers and students as the materials 
are used. 


The Need for Data on Instruction 


Classroom instruction is usually measured indirectly in the evalua- 
tion of instructional programs. Data obtained from direct observation of 
classroom interaction are seldom collected and analyzed. For example, 
among hundreds of research and evaluation reports at the ERIC Clearing- 
house on Early Childhood Education, Katz (1969b) found only ten 
observational studies reported since 1960. Of nine reports comparing 
preschool programs, only one (Katz, 1969a) included data on class- 
room interaction. Similarly, in review of curriculum evaluation in science, 
Welch (1969) discussed the evaluation activity in 46 science projects 
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Illinois, served as consultants to Dr. Rosenshine on the preparation of this chapter. 
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o the author. 


279 


REVIEW OF EDUCATIONAL RESEARCH Vol. 40, No. 2- 


(Lockard, 1968); apparently only the reports on Harvard Project Physics 
contained information on classroom interaction. 

The lack of information on classroom interaction hinders evaluation 
of a single curriculum or different curricula because without this infor- 
mation one tends to assume that all classrooms using the same curriculum 
materials constitute a homogeneous “treatment variable.” Such an assump- 
tion is questionable because teachers may vary widely in what activities 
they select and how they implement them. In studies where teacher 
behavior in special curricula was compared with the behavior of teachers 
in “traditional instruction” (to be discussed below), there often was 
significant variation in the behavior of teachers within each group. Al- 
though the number of classrooms observed in these studies is small, 
the results are consistent enough to cause serious doubts about whether 
all classrooms using the same curriculum constitute a single treatment 
variable. 

The futility of treating a curriculum as a single variable is illustrated 
in the Head Start Impact (Cicirelli, 1969). In this study a group labeled 
“Head Start Children” was compared with control children on a number 
of achievement, behavioral and attitudinal variables. Most of the post- 
tests showed no significant differences between the two groups when the 
children were in the second grade. But is there a single treatment that 
can be labeled “Head Start”? In their study of 38 teachers in a single 
summer Head Start program, Conners and Eisenberg (1966) found 
significant differences in class mean IQ gains over the summer. In 
addition, there were significant differences in several of the observed 
behaviors of the teachers of the high-, middle-, and low-achieving classes. 
These differences Suggest that within a single Head Start program the 
students were exposed to different treatments. If these differences are 
ignored in the analysis, and if something labeled “Head Start” is com- 
pared to something labeled “control,” then the comparison is being made 
between the effect of one vague complex of variables and the effect of 
another vague complex (Travers, 1969). As Stake (in press) noted in 


his introduction to the forthcoming AERA Monograph No. 6 (“Classroom 
Observation”) : 


Neither an understanding of what the curriculum has been or 
what should be tried next time is possible without data on teach- 
ing methods. In some evaluation studies the most valuable data 
will be those gathered by a classroom observation system. 


Classroom Observational Instruments 


; Instruments for the observation of instruction are currently divided 
into category systems and rating systems. This division is based on the 
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These distinctions between category systems and rating systems are 
gross because the instruments which have been developed for the observa- 
tion of classroom behavior vary greatly in both the specificity of the be- 
haviors to be observed and in the manner of recording the frequency of 
the behavior. For example, the item “teacher use of ridicule or threatening 
behavior” appears to be a low-inference item, even though it was 
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or superior in clarity. These dif- 
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ferences in scale markings make it difficult to determine if rating 
scales are being used to judge the value or estimate the frequency of a 
behavior. Perhaps many investigators dislike the use of rating systems 
because words such as superior and unacceptable appear on some scales, 
In such cases, it is rather simple to convert the scale markings so they 
are estimates of frequency. Because of this problem, it would be useful to 
determine whether the connotation of different scale markings affect 
ratings, and whether any differences which might occur are systematic 
or random. 

Category systems have become very popular in descriptive educa- 
- tional research and in teacher training because they offer greater low- 
inference specificity and because an “objective” count of a teacher's 
encouraging statements to students appears easier for a teacher to accept 
than a “subjective” rating of his warmth. The major disadvantages of 
category systems are the cost of using observers and the difficulty of speci- 
fying behaviors before they can be included in a category system. 

Rating systems offer greater flexibility than category systems because 
they can include high-inference variables. Rating systems can also be 
less expensive if the students in the classrooms are used as observers. For 
example, by using unpaid students as observers, the investigators in Har- 
vard Project Physics (Anderson, Walberg, and Welch, 1963) were able 
to obtain information on the classroom climate of more than 150 class- 
rooms without any payment to observers. The disadvantages of using 
rating systems are summarized by Mouly (1969); they include the halo 
effect, the error of central tendency, generosity or leniency error, and 
the lack of a common referent for scoring calibrations such as “excellent” 
or “seldom.” Another disadvantage, noted by Gage (1969), is that high- 
inference items are difficult to translate into specific behaviors. This 
suggests that evaluative reports based on high-inference measures may 
offer few specific suggestions for improving an instructional program. 
An evaluative report which suggests that teachers need to improve their 
clarity and organization, without giving the low-inference correlates of 
such behaviors, may amount to little more than suggesting that the teachers 
be “good and virtuous.” 


Category Systems 
Reviews and Anthologies 


Several recent articles included reviews and critiques on the develop- 
ment and use of category systems (Amidon and Simon, 1965; Biddle, 
1967; Biddle and Adams, 1967; Campbell, 1968; Gage, 1969; Lawrence, 
1966). In each of these articles, the authors cite a number of category 
systems and provide a general introduction to research in this area. Unfor- 
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tunately, three articles which are more comprehensive than any of the 
above are not yet readily available (Bellack, 1968; Furst, in press; Nut- 
hall, in press). There are at least four books of readings which contain 
reprints of original reports on the development and use of category systems 
(Amidon and Hough, 1967; Hyman, 1968; Nelson, 1969; Raths, Pancella, 
and Van Ness, 1967). The volume by Amidon and Hough contains a 
number of studies on the application of Interaction Analysis to the train- 
ing of teachers; the volume edited by Hyman contains 11 articles which 
present the descriptive results obtained when the category systems were 
used to observe teaching. 

The most comprehensive reference on observational category systems 
is the fifteen-volume anthology, Mirrors for Behavior (Simon and Boyer, 
editors). The first six volumes (Simon and Boyer, 1967) contain the 
original coding instructions for 26 systems; the nine forthcoming volumes 
(Simon and Boyer, in press) will cover an additional 54 category systems. 
The eighty systems will be summarized in Volume 15. The nine forth- 
coming volumes will be limited to 225 copies which will be distributed 
to major educational libraries. 

Far more than eighty observational category systems have been devel- 
oped. I can specify forty systems which are not cited by Simon and Boyer; 
no estimate can be made of the additional category systems which could 
be located. For example 11 category systems cited in this review were not 
included in the eighty selected by Simon and Boyer (see Bloom and Wilen- 
sky, 1967; Conners and Eisenberg, 1966; De Landsheere, 1969; Evans, 
1969; Fortune, 1967; Katz, 1969a; Morsh, 1955; Solomon, Bezdek, and 
Rosenberg, 1963; Spaulding, 1965; Vickery, 1968; Zahorik, 1969) 


Types and Examples of Observational Category Systems 


There is no simple way to classify the variety of existing category 
systems which have been developed for the observation of classroom 
behavior. Some reviewers have classified them as primarily “affective, 
“cognitive,” or as representing a combination of these dimensions (Amidon 
and Simon, 1965; Bellack, 1968; Furst, in press; Simon and Boyer, 1967). 
Although such terms may be useful for classifying the major variables in 
an observational category system, one could also classify a category system 
by the number of “factors” which it contains. 

Most category systems are one-factor systems in which each behavior 
is coded only in terms of its frequency. The variables in the factor can 
be affective, cognitive, or both. One-factor systems have been developed 
which are primarily affective (e.g. Flanders, 1965), primarily cognitive 
(e.g, Davis and Tinsley, 1967), or which focus on teacher feedback 
(Zahorik, 1968). There is no limit to the number of variables which can 
be included in a one-factor system. Systems have been developed with 
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as few as four variables (Bloom and Wilensky, 1967) or as many as 
eighty variables (De Landsheere, 1969). The major advantages of one- 
factor systems are the ease of coding and the ease with which they can 
be modified for use by other investigators. 

One-factor systems seem to offer only gross measures which may 
obscure other important classroom variables. These systems do not provide 
for specifying the content, activity, level of conceptualization, or topic 
of teacher or student behavior. For example, two teachers may be teaching 
the same BSCS unit and be coded as having identical percents of various 
types of question, yet one teacher may be discussing the use of the micro- 
scope, and another may be discussing the decoration of the bulletin board. 
Similarly, teacher praise regarding student knowledge may be a different 
variable from praise of student persistence. Student persistence itself can 
differ in context; in one class the students may be attempting persistently 
to sound out new words, and in another classroom students may be mak- 
ing Halloween decorations persistently. 

The need for more comprehensive information on classroom inter- 
action led to the development of new systems, and two approaches were 
used: the addition of “factors” in the analysis of variance factorial design 
sense, and the division of categories by subscripting the variables within 
a category. 

The clearest example of a factorial category system is the Topic 
Classification System developed by Gallagher and his associates (Gallagher, 
1968) in which each “topic” of classroom interaction is classified three 
ways: according to emphasis upon skill or content, the level of conceptuali- 
zation (e.g., data, generalization), and the logical style used by the teacher 
(e.g., description, explanation, expansion). Openshaw and Cyphert (1966) 
developed a four-factor system in which each “encounter” is categorized 
according to the origin of the encounter, the target, the mode (e.g., speak- 
ing, reading, writing), and the purpose of the behavior, This last factor 
is subdivided into five categories, each one containing three to eight sub- 
categories, 

Zahorik (1969) reported on the use of a three-factor system in which 
types of teacher feedback (Zahorik, 1968) were also classified according 
to the type of venture (Smith et al., 1964) in which they were used and 
whether the feedback occurred within or at the end of the venture. This 
system allows an investigator to explore the effect of different types of 
feedback during different types of cognitive ventures. Zahorik’s system 
appears to have strong potential value because it represents the factorial 
combination of two highly-developed category systems. 

An ingenious and unique two-factor system was developed by Denny 
(Educational Product Report, 1969). The two factors were: the type of 
question asked by teacher or student, and the activity on which the 
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question focused (e.g., teaching lecture, class report, test or quiz). The 
factor on activities may be very useful because (a) it is easy to add to one- 
factor systems such as those developed by Flanders and his students, (b) 
it may provide important information on the context of an event, and 
(c) it can be modified easily to fit the special activities of different pro- 
grams and subject areas. An activities factor developed for a reading 
program would be quite different from one developed for math, but an 
observational report which included activity variables might provide im- 
portant information on the content and activities of a lesson. 

In the category system used by Spaulding (1965), teacher behaviors 
are classified according to their (a) major type, (b) source of authority, 
(c) number of class members included, (d) amount of attention the class 
gives to the statement, (e) tone of voice, (f) technique used, and (g) 
topic. This category system is not clearly factorial because the options for 
classifying technique and topic differ according to the major behavior. 
For example, there is a topic for disapproval regarding “violation of rules,” 
but violation of rules does not appear under topics of approval. The result 
is an extremely comprehensive category system, but one which is very 
long, requires tape recordings for coding, and has been used only by 
Spaulding. 

Other investigators expanded category systems by “subscripting” larger 
categories. Thus, Amidon, Amidon, and Rosenshine (1969) developed 
the Expanded Interaction Analysis System by adding from two to four 
subscripts to each of the ten categories developed by Flanders (1965). 
This subscripting was accomplished by adding variables from other cate- 
gory systems such as those of Gallagher and Aschner (1963) and Hughes 
(1959), For example, the original category “teacher asks questions” was 
subscripted to specify four types of question, and the category teacher 
gives directions” was subscripted to specify cognitive directions and man- 
agerial directions. De Landsheere (1969) developed a nine-category system 
containing from four to eight subscripts for each category. For example, 
the category “positive feedback” was subscripted to specify (a) stereo- 
typed approval, (b) repetition of student response, (c) giving reasons for 
approval, and (d) other forms of positive feedback. An interesting com- 
promise was developed by Parakh (1968) in which 17 categories were used, 
but only 5 contained subscripts. 


General and Specific Category Systems 

Almost all of the category systems are general systems: they are de- 
signed for use in all instructional situations. Some investigators demonstrated 
the general usefulness of their systems by applying them to more than one 
subject area (see Flanders, 1965; Gallagher, 1968). Those systems developed 
for special subject areas contain little to distinguish them from the general 
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category systems. For example, of the 23 categories and subcategories in the 
Biology Teacher Behavior Inventory (Evans, 1969; Balzar, 1969), only two 
categories are specific to science, and none is specific to biology; of the 26 
categories and subcategories developed by Parakh (1967-68, 1968), there is 
only one category related to science: the subscript “gives or asks for infor- 
mation about the nature of science”; the OSCAR R (Medley and Smith, 
1968) that was developed specially for observing reading instruction has 
only two categories unique to reading. With little modification, any of the 
category systems examined for this review could be applied to another 
subject area. Even the system developed by Wright and Proctor (1961) for 
the observation of mathematics lessons could be applied to any situation in 
which the focus is upon deductive and inductive reasoning. 


Rating Systems 


In preparing this review, I found no anthologies of rating forms for 
observing teaching, no body of descriptive research resulting from the use of 
these instruments, and no reviews of research. This lack of attention to 
rating forms is regrettable because recent research using fairly specific items 
with rating forms has yielded promising results. An estimate of the 
predictability of rating systems may be obtained by studying the results of 
seven investigations in which teacher behavior was described using category 
systems and rating scales. In all the studies some items in each system were 
related significantly to the adjusted criterion score or discriminated sig- 
nificantly among teachers grouped according to student achievement. 
However, in six of the studies (or sets of studies) the bi-variate correlations 
or F-ratios were higher for specific rated behaviors (rating systems) than 
they were for specific counted behaviors (category systems) (Fortune, 1967; 
Gage et al., 1968; Morsh, 1956; Morsh, Burgess, and Smith, 1955; Solomon 
et al., 1963; Wallen, 1966; Wallen and Wodtke, 1963). The raters were 
classroom students or observers. Measures of interrater reliability were 
comparable to those obtained using category systems. The above results are 
too varied to attempt to synthesize the findings, but they suggest that ratings 
are a useful source of information about an instructional program. 

Perhaps one advantage of rating systems is that an observer is able 
to consider clues from a variety of sources before he makes his judgment. 
Even though the low-inference correlates of “clarity” are presently un- 
known, ratings on variables referring to the clarity of the teacher's 
presentation were significantly related to student achievement in all 
studies in which such a variable was used (Belgard, Rosenshine, and Gage, 
1968; Fortune, 1967; Fortune, Gage, and Shutes, 1966; Solomon, Bezdek, 
and Rosenberg, 1963; Wallen, 1966). The results on “clarity” are par- 
ticularly robust because the investigators used different observational 
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instruments. Furthermore, some investigators used student ratings, some 
used observer ratings, and the student ratings were given before the criterion 
test in some studies and after the test in others. 


Rating Forms in Curriculum Evaluation 


The studies cited above in which rating scales were used were focused 
on traditional instruction; they were not studies of the use of instructional 
packages such as a science program. Only two sets of studies were found in 
which rating scales were used to describe specific instructional packages: 
the studies of Harvard Project Physics (Welch, 1969) and the studies of 
the School Mathematics Study Group materials (Torrance and Parent, 
1966). 

Rating scales in which the students responded to specific items on a 
four- or five-point scale (ranging from “strongly agree” to “strongly 
disagree”) were used in both sets of studies. The Learning Environment 
Inventory (Walberg and Anderson, 1967) was used in the evaluation of 
Harvard Project Physics. This inventory consists of 14 scales, each contain- 
ing seven items. The scales on Diversity and Difficulty were designed 
specifically to evaluate the physics project, and the remaining scales were 
justified on the basis of research in social psychology and the results of 
previous research using a slightly different instrument (Anderson, 1968, 
pp. 20-44). í ; 

The Learning Environment Inventory, oF an earlier version of it, was 
used in a number of multivariate studies conducted to determine the 
relationship between the students’ perceptions of the class environment and ] 
class learning (Anderson and Walberg, 1968; Walberg, 1969a, 1969b), 
individual learning (Walberg and Anderson, 1968b; Anderson, 1969), 
student pretest scores (Walberg and Anderson, 1968a), and teacher person- 
ality measures (Walberg, 1968). The inventory was also used to determine 
if classes in different courses perceived the environment differently (Ander- 
son, Walberg, and Welch, 1969). Significant relationships between the 
rating scale scores and the criterion variable(s) were found in each study. 

An SMSG Student Attitude Inventory was developed as part of the 
study of School Mathematics Study Group instruction (Torrence and 
Parent, 1966). The inventory contained 64 items which focused on (a) 
whether the teacher encouraged various types of cognitive activities, (b) the 
amount of help offered by the school, (c) the clarity and organization of the 
textbook, and (d) the ability and activities of the class. Those classes 
in the upper-third (as measured by posttest scores adjusted for pretest 
achievement) had significantly higher scores than classes in the lower-third 
on 58 of the 64 items and on all four scales. 

Check lists of activities have also been used to describe instruction. 
During the first year of study by Torrance and Parent (1966) the teachers 
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completed a checklist immediately after eight lessons each month. The 
teacher checklists contained 31 highly specific teacher activities (e.g., “gave 
alternative solution to problem”) and 28 specific student activities. A single 
check indicated that the activity occurred at least once during the lesson; a 
double check was entered for continuous activity or one which occurred 
three or more times. At the end of the first year the students completed a 
checklist on 23 specific activities (e.g., “asked questions because of learning 
difficulties,” “thought of an unusual but correct solution to a problem”). 
The frequency estimate was on a six-point scale ranging from “never” to 
“every class period.” Although the checklists had items similar to those 
used in the rating scale described above, and although a lower-inference 
scale was used, almost none of the items discriminated between the upper 
third and lower third of the classes. 

In four studies of elementary reading (Harris and Serwer, 1966; Harris 
et al., 1968) the teachers completed logs on five consecutive teaching days 
each month for five months. The teachers estimated the number of minutes 
spent each day on different reading activities (e.g., sight word drill) and on 
different supportive activities (e.g., art work with reading). The results were 
used to compare the amounts of time spent by teachers using different 
reading approaches and to correlate time spent in various activities with 
measures of student growth. As in previous studies, there was much 
variation within each teaching approach in the amount of time teachers 
reported spending on various reading and supportive activities. 

The above results on the use of rating systems do not imply that rating 
systems should replace category systems. The optimum strategy for research 
and description may be to use both systems. Rating systems would be used 
to probe for unknown complexes of variables that appear to be significant 
correlates of outcome variables; category systems may be used to help 
identify the specific components of the significant items. In addition, rating 


forms and category systems which are specific to the instructional program 
may be developed. 


Uses of Observational Systems 


Four major uses of observational systems for the evaluation of instruc- 
tion are discussed in this section: (a) assessing the variability of classroom 
behavior either within or between instructional programs, (b) assessing the 
agreement between classroom behavior and certain instructional criteria, 
(c) describing what occurred in the implementation of the instructional 
materials, and (d) determining relationships between classroom behavior 
and instructional outcomes. The four areas overlap, and many of the 
studies which are cited fit more than one use. The lack of conceptual 
clarity is a reflection of the problems inherent in any new undertaking. 
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Variability in Special Curriculum Programs 


Program evaluators are particularly interested in studies in which the 
behaviors of teachers within a special curriculum were compared, or in 
which comparisons were made between the behaviors of teachers in a special 
curriculum and the behaviors of teachers using the “traditional” program. 
Such studies may be useful for ascertaining whether there is sufficient 
congruence in classroom behavior to hypothesize that an instructional 
program is a homogeneous variable, or that two instructional programs are 
indeed different instructional programs. Only a few studies of either of 
these types were found, and the generalizability of the results is limited 
by the small number of teachers observed and the disparity of category 
systems used to encode the behaviors. Therefore the findings discussed 
below might better be treated as hypotheses for further study. 

Gallagher (1966, 1968) recorded the discussion sections of six BSCS 
teachers while each was presenting identical material. He coded the dis- 
cussions using his Topic Category System and found significant differences 
among the teachers in the number of topics on goals (content vs. style) 
and on conceptual level (data, concept and generalization). There were 
no significant differences among classes on the number of topics coded 
according to cognitive style (eg., description, explanation, evaluation, and 
expansion), but there were significant differences among classes in the 
percentage of teacher talk devoted to the cognitive styles of description, 
explanation, and expansion. 

The reports of other investigators also demonstrate the variability of 
teacher behavior within a specific curriculum, although the statistical 
significance of the differences was not reported. Such trends are evident 
from reports of investigators who used category systems (Parakh, 1967, 
1968) and teacher log reports (Harris and Serwer, 1966; Harris et al., 
1968). 
Other investigators used different category systems to compare the 
behavior in traditional classrooms to the observed behavior in classrooms in 
which special instructional materials were used. Although a large number 
of comparisons was made, very few differences were statistically significant 
(Evans, 1969; Furst and Honigman, 1969; Hunter, 1969; Vickery, 1968; 
Wright, 1967). I conclude that such results occurred because there was 
greater variation in student or teacher behavior within these curricula than 
among the curricula. 

In two of the above studies (Hunter, 1969; Vickery, 1968) student use 
of laboratory material was a major focus of the special instructional program. 
The category systems developed for classroom observation included items on 


student use of materials and on teacher verbal behavior. In Hunter’s study, 


there was significantly more talk among students while they used the 
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materials, but no significant differences on verbal behaviors such as types 
of teacher questions or types of student response to the teacher. In Vickery’s 
study, teachers using the experimental materials spent significantly more 
time interacting with individuals in laboratory centered behaviors, but 
there were no significant differences among the teacher groups in the amount 
of indirect or direct teacher talk as measured by the Interaction Analysis 
categories. 

The results obtained by Hunter and by Vickery may be related to 
the training program. In both studies the in-service program for introducing 
the teachers to the new curriculum stressed the relatively simple behavior of 
increased student use of science materials and, apparently, there was no 
specific instruction in teacher use of “higher order questions,” or encourage- 
ment of varied types of student statement or question. In the studies by 
Hunter and by Vickery the teachers may have modified their behavior 
exactly as they were trained; no more and no less. Similarly, in the other 
programs discussed above, the teachers were given new materials and new 
books, but apparently they were not given training in specific instructional 
techniques. The teachers were free to vary their instruction, and they 
apparently did so. 

One exception to these studies yielding no significant differences in 
teacher behavior using different instructional materials was obtained by 
Anderson, Walberg, and Welch (1969). Student ratings obtained on the 
Learning Environment Inventory indicated that the overall climate of classes 
using the Harvard Project Physics materials was different from that of the 
classes using other instructional materials. As the investigators predicted, 
students in the experimental classes rated their classes as significantly lower 
in Difficulty and higher on the Diversity scale. These two scales were 
specifically developed to evaluate the experimental program. All teachers 
using the experimental materials received special summer training. 

In almost all of the studies discussed above there were wide variations 
in the classrooms behaviors of teachers who were using the same instructional 
materials. Gallagher (1966, p. 33) concluded: 


The data would suggest that there really is no such thing as a BSCS 
curriculum presentation in the schools. . . . Each teacher filters the 
materials through his own perceptions and to say that a student has 
been through the BSCS curriculum probably does not give as much 
specific information as the curriculum innovators might have hoped. 


Gallagher’s conclusion might also be drawn. from the other studies in this 
area. But if such an hypothesis is to be considered, then it is inconsistent 
to claim, as some have, that teachers are teaching the “new curriculum” in 
the same manner as they taught the “old.” Perhaps the variation among 
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teachers is so great that we cannot speak of an old or a new curriculum as 
a single instructional variable. 

Although observational systems have been used to specify variability 
in behavior across classrooms, it is difficult to assess the importance of all 
these variations. Some of the behaviors may be important because their 
frequency or patterning may be significantly related to student outcome 
measures, and some of the behaviors may indicate classroom conditions 
within which other forces acted; but some of the behaviors may have no 
relevant educational meaning. At this time it is difficult to distinguish the 
relevant behaviors from the irrelevant ones. 


Criterion-referenced Instruction 


A major use of an observational category system would be to determine 
whether the teachers are implementing an instructional package in an 
acceptable manner. The criteria for “acceptable” can come from the 
curriculum developers, the evaluators, or the teachers. 

Those who developed a set of instructional materials seldom provided 
explicit guidelines on classroom behaviors pertinent to evaluating instruction. 
Gallagher (1968, p. 43) noted that: 


_.. those interested in curriculum development have not finished 
their job when they have packaged a cognitively valid and consistent 
set of materials. They must establish, in addition, how these materials 
are operationally introduced in the classroom environment. Otherwise, 
they will be left with certain unqualified assumptions as to how their 


package is unwrapped in the classroom. 


In the above studies, three different category systems were used to observe 
instruction using the BSCS materials, but not one system was developed 
either for observing BSCS instruction or for assessing whether the teachers 
were performing in accordance with the BSCS design. If the developers of 
new instructional materials are accountable for the instructional results, 
perhaps they and the evaluation team should also determine whether the 
teachers are implementing the materials in an acceptable manner; perhaps 
they should also develop observational systems which can be used both in 
teacher training and in assessing whether the new curriculum is being given 
a fair trial. 

Only one example was found of the use of “quality assurance” as part 
of curriculum implementation. This was the description of the Oral 
Language Program disseminated by the Southwestern Cooperative Educa- 
tional Laboratory, Albuquerque, New Mexico (Olivero, undated). “Per- 
formance standards” were set for the teachers using the program. Teachers 
who were unwilling to accept the standards had the right to stop using the 
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program; they could also attend special in-service workshops, but “ultimate- 
ly, if the teacher is unable or unwilling to perform, he or she will be asked to 
stop using the Oral Language Program” (Olivero, undated, p. 8). 

Two experimental studies were found in which an observational system 
was used to determine fidelity to instructional specifications. In a study of 
two approaches to preschool instruction, Katz (1969b) used the data 
obtained from her category system to determine that the three experimental 
teachers were not performing according to the predetermined criteria, and 
therefore she decided that the experiment was not a valid test of her 
hypothesis. 

Worthen (1968a, 1968b) compared the effects of using a discovery and 
an expository approach to teach mathematical concepts. Model discovery and 
model expository teaching behavior on each of five characteristics was 
specified, and a paradigm of teaching techniques for each method was 
established. The eight participating fifth and sixth grade teachers received 
two to six hours’ weekly training for 13 weeks prior to and for seven weeks 
during the experiment. This training included use of two specific methods 
of instruction and familiarization with the mathematical concepts used in 
the instructional materials. Each teacher used an expository approach with 
one class and a discovery approach with another. An observer rating scale 
and a student questionnaire were used to assess the degree to which the 
teacher adhered to the prescribed teaching model. Mean scores on each 
instrument indicated that the teachers taught the two classes in distinctly 
different ways, and that the teachers varied their behaviors in a manner 
appropriate to each method. 

The work by Worthen might serve as a model for research of this type 
because it included clear specification of behaviors, extensive training of 
teachers in the use of specific teaching behaviors, development of obser- 
vational instruments specific to the instructional procedures, and use of these 
instruments by observers and by students. It is also the only study I found in 
which students were used as observers of teacher adherence to a specific 
instructional model. 

Such an approach to performance standards might be extended to the 
evaluation of other special curriculum projects, Classroom observational 
techniques can be used to determine whether the classroom instruction is 
continuing according to the plans of the curriculum developers. If it can be 
determined that some teachers are deviating from the intentions of the 
curriculum developers, the final report could have two sections, one for all 
the participating teachers and one for teachers whose behavior met specified 
criteria. The second section would represent the effects of the instructional 
materials under intended conditions; the first section would represent the 
effects under more general conditions. The number of teacher dropouts 
because of inability to meet performance specifications would be an 
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additional measure of the amount of extra work necessary to implement the 
curriculum in accordance with the intentions of the designers. 


Relating Observed Behaviors to Outcome Measures 


Compared to the large number of descriptive studies, there have been 
relatively few studies of the relationship between measures obtained by the 
use of observational systems and measures of class achievement adjusted for 
initial aptitude or ability. Approximately forty studies were completed in 
this area. These studies are difficult to synthesize for at least three reasons: 
they varied widely in subject area, grade, and observational instruments; 
some studies used statistical procedures which were not appropriate; and in 
many studies the number of classrooms observed was less than twenty. 
Two attempts have been made to review some of the research relating 
classroom behavior to student achievement (Flanders, 1970; Rosenshine, in 
press a), but because of the above difficulties it is too early to identify 
relationships that can be stated with any confidence. 

Most of these studies were investigations of teachers engaging in 
traditional instruction or using materials specially designed for use in the 
study. There were few studies relating teaching behaviors to student achieve- 
ment within the context of special instructional packages such as the BSCS. 
An evaluator who has collected information on classroom behavior and 
student outcomes is in an excellent position to contribute to this research 


by attempting to relate the two. 


Difficulties in Selecting and Using Observational Instruments 


Selecting or Developing an Appropriate Observational System 


In most instances, a teacher, administrator, or evaluator has little to 
guide him in selecting an observational instrument. Those who develop sets 
of instructional materials seldom provide guidelines specifying the classroom 
behaviors that are important in evaluating instruction. As discussed above, 
the category systems labeled as specific to a subject area include little to 
justify the label. Given these problems, evaluators frequently select a 
general category system which has been used in some other program in the 
hope that it will be useful for their specific purposes. 

An alternative approach is first to identify the objectives of a program, 
then to study the instructional materials, and finally to identify a few 
behaviors or combinations of behaviors which seem most critical to the 
implementation of the materials and the achievement of the objectives. Once 
these performance criteria are identified, their frequency of occurrence could 
be assessed by using existing category systems or rating forms, by selecting 
relevant sections from existing instruments, or by modifying existing instru- 
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ments to meet the specific purposes of the evaluation. Successive observations 
would then be made to determine how well these criteria are being met and 
to chart the results of a teacher's attempts to modify his behavior. Such 
procedures are detailed in a forthcoming book by Flanders (1970, Chapter 
11). 


Simple and Complex Observational Systems 


Problems of cost may dictate that a relatively simple observational 
instrument be used. However, simple observational instruments may obscure 
important behaviors. For example, “student initiated talk” is easy to classify 
and record, but such talk may take place in a variety of contexts. Student 
initiated talk may be a contribution to a discussion, but it might also repre- 
sent either persistent attempts to get the teacher to clarify what he said, or 
student outbursts in a disorganized classroom. If the category system is 
enlarged so each type of student initiated talk is considered separately, the 
system may become unmanageable. Enlarging the category system may be 
costly in training time and data processing, but such an enlargement may 
yield important data on variability among teachers, fidelity of instruction 
to an instructional model, and the relationship of teaching behaviors to 
student outcome measures. 

One of the most difficult problems is choosing between a simple and 
a complex unit of measure (Biddle, 1967). In most category systems, fre- 
quency counts serve as the unit of measure, but some investigators have 
developed more complex units such as a topic (Gallagher, 1968), a venture 
(Smith et al., 1964), or a teaching module (McNaughton et al., 1967). 
These more complex units may be intrinsically more satisfying because 
they code patterns and sequences of classroom interaction; but typescripts 
are required to code these units, and preparing typescripts and training 
observers is costly. 


Description and Judgment 


When an investigator presents observational data on instruction, the 
descriptive statements are frequently accompanied by judgments of the 
behaviors that a teacher “should” exhibit, Sometimes these judgments are 
softened by a statement such as “it would seem reasonable to expect that 
.++,” but the judgments remain and are frequently followed by prescriptions 
for teacher training. 

For example, Hunter (1969, p. 42) noted that the teachers in her 
sample “do not give reasons for praising right answers or rejecting wrong 
ones.” Morrissett, Stevens, and Woodley (1969, p. 269) reported a descrip- 
tive study in which “the tabulation showed an almost complete absence of 
teachers’ accepting a student’s feelings,” Gallagher (1968, p. 44) stated that 


294 


ROSENSHINE EVALUATION OF INSTRUCTION 


he “was disappointed at the relatively rare use, sometimes complete absence, 
of topics in the dimensions of Expansion and Evaluation.” In all three 
studies the teachers were using instructional materials prepared by national 
curriculum groups. 

It is difficult to justify the above use of implied “should’s” on two 
counts. First, these evaluators were using general observational systems 
whose relevance to the instructional programs they were evaluating had 
not been established. Second, there were too few studies to date which 
attempted to establish the correlation between classroom behaviors and 
student criterion measures, and too little synthesis of this research; so the 
implied “should’s” in the paragraph above do not rest on established 
empirical evidence. Even if consistent and significant linkages between 
classroom behavior and student outcome measures were established, one 
would need clear statements on the optimum frequencies for behaviors such 
as giving reasons for rejecting wrong answers, accepting students’ feelings, 
or introducing Expansion or Evaluation topics. 


Additional Sources for Observational Variables 


Developers of observational systems appear to have made little use of 
the variables currently being investigated in laboratory research on instruc- 
tion. Variables such as “organizers,” relevant practice, promoting learner 
interest, prompts and fading techniques, organization and sequence, and 
pacing have been studied using meaningful verbal materials, but in situa- 
tions in which the instruction was mediated by written materials, films, and 
audiotape recordings (see Anderson et al., 1969; Popham, 1969; Tanck, 
1969). The results of these studies do not appear in the observational instru- 
ments currently available, but there is nothing to prevent evaluators from 
including these variables in new observational systems. Se OR 

One example of the inclusion of variables deve in laboratory 
studies in a dasa observational system is the study by Worthen (1968a, 
1968b). In this study, a special rating system was used to determine whether 
the teachers were following the prescribed discovery and expository 
approaches. Rating systems appear quite appropriate for such initial 
ventures, and evaluators of programs which have a specific instructional 
focus might develop rating systems which include instructional variables 
developed in laboratory studies. Depending upon the purpose of the 
evaluation, the markings on the scale can require estimates of frequency as 


well as estimates of value. 


Date Reduction and Consolidation 


Because of the large number of variables in existing observational 
systems and the difficulty of comparing variables with similar sounding 
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names in different observational systems, some reviewers suggested that 
the number be reduced and the variables compared by using data reduction 
techniques such as factor analysis or facet analysis (Biddle, 1967; Gage, 
1969). Factor analytic techniques may be useful to the evaluator of a single 
program for which no replication is intended, but it would be hazardous to 
generalize from factor analytic results for three reasons: in most of these 
studies the number of classrooms is smaller than the number of variables; 
investigators using the same observational system, but in different locations, 
might not agree with others; and one does not know if the factor structure 
would be generalizable across different subject areas, different grades, or 
different curricula in the same subject area. 

Several attempts have been made to reduce the data from two or more 
category systems using principal components factor analysis with varimax 
rotation (Soar, 1966; Medley, 1969; Wood et al., 1969), but the results are 
not easy to compare, and in all three studies teachers were selected from 
a wide range of grade levels. The results are puzzling in the most compre- 
hensive study (Wood et al., 1969). For example, student statements coded 
as exhibiting application, analysis, synthesis, and evaluation all loaded on a 
factor devoid of any teacher behaviors; teacher behaviors such as amplifica- 
tion or extended amplification of student ideas loaded on a factor which 
did not contain teacher or student cognitive behaviors; and the factor labeled 
“teacher-student cognitive behavior” did not contain any student or 
teacher affective behaviors. 

Solomon, Bezdek, and Rosenberg (1963) used factor analysis to 
consolidate information obtained from student ratings, observer ratings, 
and teacher questionnaires, with information obtained by coding each 
independent clause of classroom verbal behavior into a category system. 
Although the number of variables far exceeded the number of classrooms, 
such an approach might be useful for comparing observer and student 
perceptions and for discovering relationships between high-inference and 
low-inference variables. 


Summary 


Without adequate data on classroom transactions, it is difficult for an 
evaluator to make suggestions for the modification of an instructional pro- 
gram. Yet, researchers are only beginning to develop tools and concepts for 
the evaluation and study of instruction. Currently, three major needs are: 
greater specification of the teaching strategies to be used with instructional 
materials, improved observational instruments that attend to the context 
of the interactions and describe classroom interactions in more appropriate 
units than frequency counts, and more research into the relationship between 
classroom events and student outcome measures. 
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6: MEASUREMENT TECHNIQUES 
IN EVALUATION 


DOUGLAS D. SJOGREN 
Colorado State University* 


The demand for sophisticated and compre- 
hensive evaluation of educational efforts is affecting thinking about 
measurement techniques and strategies. Evaluation for many years has been 
equated with a process of determining whether specified objectives are 
attained, but current evaluation models focus on a larger number of 
phenomena. The objectives of an educational effort are still an important 
component of evaluation, but the current models are more inclusive. 
Evaluation theorists indicate that evaluation should attend to outcomes 
other than specified objectives, to inputs or antecedent conditions, and to 
processes or transactions. 

The implementation of an input-process-outcome evaluation plan raises 
important measurement problems. The inclusion of the many variables in 
a comprehensive evaluation requires a massive amount of measurement and 
classification. There are also problems associated with obtaining valid and 
reliable measurement and classifications of a great many variables, including 
many not traditionally considered in evaluation methodology. 


Measurement of Inputs or Antecedents** 


In considering the inputs or the antecedent conditions of an educational 
program, one’s first thought is typically about the characteristics of the 
students and staff, Important data include pupil characteristics (e.g. mental 
ability, past achievement, sex, ethnic background) and information on the 
professional staff. These data are generally obtainable by well-accepted 
procedures, Some of these data, especially that on socio-economic status, do 


*Dr. Sjogren is now at the University of Illinois. Lee J. Cronbach, Stanford University; 
Garlie Forehand, Carnegie-Mellon University; Leonard Cahen, Educational Testing 
Service; and Robert Heath, Stanford University, served as consultants to Dr. Sjogren on 
the preparation of this chapter. 

**Classification of variables as being input or antecedent, transaction or process, or 
outcome variables was arbitrarily decided by the reviewer. Many of the variables could 
reasonably be considered as appropriate to each or all of the three categories. Which 
category is used is of little importance. The important thing is to recognize the variables 
4s important components of evaluation. 
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present problems. School records are often inadequate for a reliable assess- 
ment of this variable. 


Many evaluation designs include before and after measurement of 
certain variables; the performance on the before measure is input. Some 
developments in measurement relevant to obtaining pre-program measures 
are reviewed in this chapter in the section Measurement of Outcomes. There 
are other variables, however, that are conceptualized under the category of 
inputs and on which data might well be gathered in evaluation studies. For 
example, it might be quite important that information be obtained on certain 
community characteristics or on characteristics of the students’ parents in 
the school in which the program is being conducted and evaluated. Often 
the gathering of these data does not require any new instrumentation be- 
cause they are available in existing files such as school records, city or 
county offices, Chamber of Commerce files, etc. But, the evaluator may 
need to gather primary data for other information. For example, it may be 
desirable to know the attitudes and opinions of the parents toward certain 
aspects of the school at the start of the program. If a sizable group of parents 
have attitudes or opinions in conflict with the intent of the program, its 
effectiveness could be influenced. Furthermore, the objective data available 
in existing files does not often reveal the atmosphere of the community. 
Much important information, such as whether riots have occurred, what pat- 
terns of de facto segregation exist, what tension-reducing activities have been 
carried on, or what the youth sub-culture is like, will require a primary data 
gathering effort. Stufflebeam (1968) made excellent comments on the 
importance of such information for evaluation. 


Primary data on community or parental characteristics are usually 
obtained by an interview or mailed questionnaire built specifically for the 
project. The instrument should be carefully constructed so it will effectively 
obtain the needed data. There are several sources that provide useful 
guidelines for constructing instruments and conducting surveys: Maccoby 
and Maccoby (1954), Cannell and Kahn (1968), Hyman (1955), Lazars- 
feld and Rosenberg (1955), Selltiz et al. (1959), Oppenheim (1966). 

Econometric techniques are being applied in educational evaluation 
through cost-benefit type analyses (cf. Warmbrod, 1968) or by adaptation 
as reflected in the systems model of evaluation, The application of such 
techniques is creating a recognition of the usefulness of two other types of 
data on inputs. The most obvious of these is cost data, identifying the costs 
of the various components of the program. Measurements of costs appears to 
be a straightforward procedure, but it is not that simple. Works by Kaufman 
et al. (1969), Mulert and Wykstra (1968), Thomas (1969), Kraft 
(1969), Blaug (1968), and Alkin (1969) contain excellent discussions of 
the problems associated with the measurement of costs. One problem occurs 
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because the accounting procedures used in schools make it difficult to 
allocate costs to individual programs. The Planning Programming Budget- 
ing System (see for example, Grosse, 1967) is a system that might alleviate 
this problem; however, Hitch (1967) pointed out the limitations of this 
system for educational planning. Even though he recognizes the limitations 
of cost data, the evaluator should attempt to measure costs of a program and 
regard such data as an important input measure. 

Kaufmann et al. (1969) wrote that “all costs are fundamentally oppor- 
tunity costs” that is “the cost of foregoing the next best alternative.” The 
opportunity cost of a program depends upon the discount rate applied to 
the economic investment; the discount rate will vary among individuals 
and social systems. Mulert and Wykstra (1968) argued that personal 
discount rates are a function of personal values. The economists, in their 
jargon, seem to be talking about a concept similar to what Scriven (1967) 
and Glass (1969) stressed in their discussions about the need to judge the 
value of an educational endeavor in terms of “Is this a worthwhile thing to 
do?” or “Is there a better way to do it?” Taylor (1966) and Maguire (1967) 
sought to measure the value ascribed to educational objectives and pro- 
cedures. They followed their thesis work with efforts in which they found 
that certain groups (e.g. teachers, curriculum writers, and experts) generally 
differentiate in their valuing of certain objectives (Taylor and Maguire, 
1967; Maguire, 1968, 1969). Maguire also found that values placed on 
objectives are predictive of reported priority of the objectives. ; 

The judgment of the worth of an educational endeavor ultimately 
depends on the values of the judge. Consequently, by measuring the values 
of the various judges, the evaluator would have very useful data on a 


critical antecedent condition; he would have evidence on what outcomes are 


program provide important evaluative information. 
are readily available at little cost, and the eval 
obtaining such data in the evaluation plan. Other measures are not yet 
well-developed, but, hopefully, measurement procedures will be developed 
so that this potentially rich information can be included in evaluations. 


Evaluators of educational programs can benefit from a study of operations 
research techniques as they might be applied to education. Blaug (1966) 


provided an excellent bibliography of operations research applications. 


Measurement of Processes of Transactions 


Evaluation based on measurement of attainment of objectives ignores 
another category of important data. To judge adequately the worth of a 
program or its effectiveness, one must have a description of what occurred 


during the program. 
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Technology for measuring process or transaction variables is not well 
developed, probably because such variables were not recognized as important 
in evaluation until recently. The recognition of the need for such measures 
has led to some promising work and thought. Furthermore, there are some 
techniques that were developed for other purposes that can be readily 
adapted to measurement of process or transaction variables for evaluation. 

Gordon (1967) and Stake and Denny (1968) discussed the need for 
methods to measure materials. Gordon (1968, p. 26) suggested that “Each 
piece or set of materials can be studied to explicate: (a) content goals, (b) 
long-range goals, (c) teacher requirements, (d) pupil expectations, and 
(e) community factors.” He stressed the importance of assessing the internal 
consistency of materials. Tyler (1968) presented guidelines for determining 
the essential, necessary, and needed specifications of instructional materials 
following the idea of the technical manual for tests (American Psycho- 
logical Association, 1966). 

The Educational Products Information Exchange Institute was started 
in 1967 (see EPIE Forum, Vol. I, No. 1) with the expressed purpose of 
providing assessment and evaluation of instructional materials. Although the 
Institute has not been able to fulfill its purpose as well as would be desired, 
their publication (formerly EPIE Forum, now Educational Product Report) 
has contained a section in which descriptive information is provided about a 
certain class of educational materials. For example, in Volume I, No. 2, of the 
EPIE Forum there is a section on elementary school science series. The 
descriptive material is useful, but hopefully the Institute will be also able to 

include substantiated judgments of the materials. 

Some work has been done on developing instruments for assessing 
instructional materials. Eash (1969) reported on an instrument that in its 
try-out form yielded reliable ratings of certain characteristics of materials. 
Anderson (1969) described an instrument used in assessing shorthand text- 
books which seems to be adaptable to other areas, Easley, Kendzior, and 
Wallace (1967) adapted an instrument for assessing biology tests (Easley, 
Jenkins, and Ashenfelter, 1967) to assessment of biology texts. Four 
general areas were rated: student task descriptions, methods of presentation, 
knowledge mode of assignable units, and image of science. They reported 
that raters using the instrument required rather intensive training before the 
profiles were reliable. Dick (1968) developed a methodology for the 
evaluation of programed materials which might be adaptable to evaluation 
of other kinds of instructional material. Efforts at instrumentation for the 
assessment and evaluation of instructional materials are just starting, but the 
early results are encouraging and several instruments and techniques will 
probably be available for this purpose soon. 

Another type of measure that would be useful for evaluation is one 
that would measure or describe the milieu within which the educational 
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program occurs. Menne (1967) reviewed techniques that were developed for 
assessing college environments; he classified the techniques into three 
categories. In one category he placed the techniques that measure the 
environment in terms of objective institutional characteristics. This approach 
is exemplified by the Environmental Assessment Technique developed by 
Astin (1962, 1963, 1965a; Astin and Holland, 1961; and Creager and 
Astin, 1968). Richards, Rand, and Rand (1966 )adapted this technique to 
the measurement of junior college environments. Several studies in which 
the technique was used indicated that it effectively differentiated among 
colleges and that the measures correlated with student behavior and out- 
comes. 

The perceptions that students have of their environment is a second 
approach to measuring college environment. This approach is exemplified 
by the College Characteristics Index developed by Pace and Stern (1958) 
and a later instrument by Pace (1963) called the College and University 
Environment Scales (CUES). Berdie (1967) found that the results from the 
use of the latter instrument were sensitive to the scoring method used. Robin- 
son and Seligman (1969) reported on a study in which a campus morale 
scale was isolated from the CUES instrument. The third approach, accord- 
ing to Menne, is exemplified by the Inventory of College Activities as 
developed by Astin (1965b). This approach is directed at describing the 
environment in terms of observable student behaviors. 

Menne evaluated the three approaches and concluded that “when the 
study envisions an environmental manipulation the perception approach 
is preferred.” Most educational programs to be evaluated would involve 
manipulation of the environment. The reviewer concurs with Menne 
that at present the perception approach would be most useful in such 
situations because the concern would generally be with how the environ- 
ment is perceived by the participant rather than with what the environ- 
ment is in some objective sense. The third approach, observable student 
behavior, would also often be desirable. Unfortunately, this approach is 
not as well developed as the other two. Work on such an instrument for 
a public school setting would be a useful project. 

Wolf (1965) provided a useful discussion of the need for environ- 
mental measures in a public school context. Few such measures are avail- 
able, however, Stern, Stein, and Bloom (1956) described the Stern High 
School Characteristics Index. Herr (1965) reported a study in which differ- 
ential perceptions of environmental press were indicated with the Stern 
instrument. Sinclair (1969) reported on an adaptation of the CUES instru- 
ment to the measurement of elementary school environments. The adapta- 
tion did seem to be effective in detecting differences and patterns in 
elementary school environments. An adaptation for the secondary school 
is being developed. Another system of measuring the public school environ- 
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ment was reported by Barker and Gump (1964) and Gump (1968) in 
which they described work on defining the ecology of the classroom and 
the school. This work of Gump was focused more on classroom environment 
than on school environment, but the system he used seems adaptable to 
describing rather large units, Doremus (1966) developed an instrument 
for measuring the educational process in schools. Four general indicators 
of quality were found to differentiate among schools: creativity, individual- 
ization, group activity, and humanization as provided for in the classroom. 


The “stuff” of processes or transactions is what happens in the class- 
room. Description and measurement of the activities in the classroom are 
essential components of an educational evaluation. Several instruments 
and methods for observing and recording classroom situations are available. 
There are two excellent publications that review the instruments and 
methods for classroom observation and obviate the need for review of the 
various methods in this chapter. Medley and Mitzel (1963) provided an 
excellent review and critique of the various methods. This source is also 
useful for the presentation of methods of analyzing the data obtained with 
the instruments. Simon and Boyer (1967) prepared an anthology of class- 
room observation instruments. They classified the systems into three 
families: effective, cognitive, and composite. This publication is a valuable 
source for assisting the evaluator in selecting a system appropriate to his 
evaluation effort. Other work on classroom observation not reviewed in 
either of the above publications includes Walberg (1966); Gump (1967); 
Steele (1969); Meux (1967); and Gallagher, Nuthall and Rosenshine (in 
press). Weick (1968) discussed the general nature of the observation of 
behavior in groups. 

Two problems persist in the use of classroom observation. First, al- 
though there are many observation systems, none has been especially 
useful for a purpose that Glaser (1968) believed important. Glaser argued 
that there is a need for measuring treatments along some meaningful 
dimension. For descriptive or assessment purposes such scales may not be 
essential, but they are essential if evaluators are to strive for generalizability. 
Generalizability of results of educational programs is likely to be limited 
until this kind of measurement of treatment is achieved. In measurement 
of treatment there is a concern that the intended treatment can be mean- 
ingfully classified or placed on some continuum, Whether the intended 
treatment was carried out is another concern and measurement problem. 
Steele (1969b) developed a procedure for measuring the congruence 
between a teacher’s intents and practices and found the procedure useful 
for assessing the existence and degree of stability of an instructional treat- 
ment. Steele’s effort is a promising beginning. 

The second persistent problem is that of eliminating the effect of 
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the observation process: Weick (1968) and O'Keefe (1968) suggested 
procedures for reducing the reactive effect of the observer, but few of 
these procedures have been tested empirically, Webb et al. (1966) discussed 
and provided examples of the many unobtrusive measures that might be 
used for measuring processes as well as outcomes. Observer effect and 
reactive effects of instruments are persistent problems for the evaluator 
and demand immediate attention. 


Additional References: Barton (1961); Forehand and Gilmer (undated). 


Measurement of Outcomes 


Evaluation efforts have typically concentrated on measurement of 
outcome variables, especially those that were specifically stated as objec- 
tives of the program. Measurement of objectives is still considered an 
important evaluative activity. An educational program has many positive 
and negative outcomes that are not stated objectives, but which should 
be considered in evaluating the program. Metfessel and Michael (1967) 
discussed many of the criteria or outcomes, intended and unintended, that 
might be included in evaluation of an educational program. The appendix 
to the Metfessel and Michael article is especially useful since it consists 
of a lengthy list of possible outcomes of a program and suggestions for 
measuring them. 

Educational economists have called attention to a number of possible 
outcomes that have not often been considered. The works of Warmbrod 
(1968), Kotz (1967a, 1967b), and Thomas (1969) are excellent sources 
for learning about the relationship of certain economic concepts to educa- 
tional evaluation. Some questions that the economists raise include: (a) 
What benefits accrue to both the individual and society from the program?; 
(b) What are the investment returns and the consumption returns of 
the program?; (c) Are there “shadow” benefits of the program, i.e., 
what are the non-economic benefits such as being a better consumer of 
the arts as a result of the program?; and (d) What are the trade-offs, 
ie, what did this person not learn by being in this program instead of 
another? Obviously these are important outcomes of most educational 
programs and should receive attention in evaluation of the program. Meas- 
ures relevant to questions such as these are scarce and their development 
should have a high priority. 

The stated objectives of educational programs are generally concerned 
with a change in behavior such as a changed attitude, perception, or skill 
level, or an increase in knowledge. The measurement of change is usually 
obtained by observing the difference between scores on a pretest and a 
posttest. It is well known that such scores have serious limitations for 
analysis purposes, The most serious limitation is the unreliability of change 
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scores. A technical source for learning of the problems associated with 
change scores is Harris (1963), Other authors who discuss the technical 
problems of change scores are Manning and DuBois (1962); Tucker, 
Damarin, and Messick (1966); Traub (1967, 1968); Glass (1968); Camp- 
bell (1969); Saupe (1966); Stanley (1966); Englehart (1967); and DuBois 
and Mayo (in press). The Engelhart article should be read by persons 
with little sophistication in analysis techniques who often feel the analysis 
procedures resolve the measurement problems. The author pointed out 
the equivalence of certain statistical procedures and that the problems 
inherent in change scores are not necessarily resolved by the analysis 
design. 

Reading the literature on change scores may be discouraging. Evaluators 
want to measure change, but they must be aware of the problems in doing 
so. Actually many of the writers in this area have suggested procedures 
that might be used in a pre-post or time series design. Cronbach and 
Furby (in press) pointed out that change scores are not necessary for 
most purposes and that the concern about measuring change per se is 
unnecessary. Furthermore, they contended that many of the suggested 
procedures are unsound. The authors presented procedures for handling 
data in a pre-post design. If the evaluation problem is one of determining 
whether a single group of students has changed from one time to the 
next, then a test of the significance of the observed differences is appro- 
priate along with reporting the observed means. Cronbach and Furby 
further suggested that the analysis of covariance is appropriate in situations 
in which two or more treatments are studied and the groups are randomly 
formed. In such a situation the pretest would be used as a covariate and 
a comparison would be made of the posttest means adjusted on the basis 
of the pretest. The adjusted posttest means would be reported. Stanley 
(1966) and Evans and Anastasio (1968) pointed out that the treatment 
should be independent of the covariate; Elashoff (1969) provided an 
excellent discussion of the sensitivity of the analysis of covariance technique 
to violation of its assumptions. Unfortunately the non-random group 
situation, which is most typical, presents the difficult analysis problem. 
Cronbach and Furby suggested analysis of covariance or blocking in this — 
situation. If analysis of covariance is used, they argued that the covariate 
for a subject should be an estimate of the true score on that variable. 
Furthermore, they indicated that several covariates be used with true scores — 
and that each estimate be obtained by regressing on each other as well 
as on the criterion score or scores. Even with such an analysis the authors — 
pointed out that results should be regarded with suspicion. In none of 
these procedures would a change score be used, but an indication of 
whether change or differential change occurred would be permitted. 
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The work on change scores has been generally on situations in which 
reliable individual measurement was the concern. Cronbach (1963) pointed 
out that evaluation efforts should be more concerned with measuring the 
educational program than the participant, and that such efforts should 
involve different kinds and ways of obtaining reliable measures. This does 
not mean that reliability of individual measurement is of no concern to 
the evaluator, It is, and evaluation effort will surely benefit pee meee. 
ments made in measuring individual differences. Nevertheless, the tor 
needs to recognize what is being evaluated (usually programs, not people) 
and that there are some procedures that can be used to obtain reliable 
measures of program effects, including change, even though the measures 
may not be especially reliable for measuring persons. 

Cronbach (1963) indicated that in measurement for evaluation it 
was unnecessary and perhaps even undesirable that every student be ad- 
ministered all measures. He suggested that more information would be 
obtained about the educational program if a large item pool were con- 
structed and samples of items were given to samples of students. The data 
would then consist of the performance on each item of the sample of 
students that took the item. This procedure is essentially an 
of matrix sampling, which refers to a situation in which samples might be 
drawn from two or more domains. For example, 
sample items, students, and observation or p a Husek 
Sirotnik (1968b) provided a helpful discussion 
theoretical basis Pe the matrix sampling procedure was developed by 
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press); and Owens and Stufflebeam (1969). The first three studies were 
not strict tests of the theory since they were done on existing data. In 
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which should help reduce the gap between the theoretician and the prac- 
titioner. Efforts such as these are important for application of the tech- 
niques. I believe, however, that there will be little application of techniques 
such as matrix sampling or generalizability theory until the procedures 
are presented in a “cookbook” approach. Cronbach et al. (in press) 
presented many worked examples and may have written the type of book 
called for here. 

Osburn (1968) discussed the generalizability theory of Cronbach et 
al. (1963) and concluded that for the theory to hold it would be necessary 
to develop item generating rules. This would permit the definition of an 
item pool that was truly a universe, and sampling from such a pool 
would permit generalization to the universe. He also commented on a 
point made in an earlier article by him (Osburn, 1967) that in a pre- 
post-type design with item sampling, the items should not ©« matched 
between the two administrations. Hively, Patterson, and Page (1968) 
reported on a study in which they applied generalizability theory. This 
article provides a good example of item generating rules and the analysis 
procedures used. The work of Fremer and Anastasio (1969) on computer- 
assisted item writing is a promising step toward defining item universes 
from which to sample. 

Although Osburn’s observation is probably accurate, it seems that 
generalizability theory or matrix sampling ideas can be applied to evalua- 
tion without being concerned that the universes be completely defined. 
Osburn is probably correct if the concern is to obtain moments of the item 
distributions across students. In the evaluation setting, however, the 
concern is usually with each item, not with putting scores together over 
sets of items. It would be useful in that sense for the evaluator to build 
large item pools, making them as representative as possible, and administer- 
ing the items with a matrix sampling design by sampling items, students, 
and testing situations. Such a procedure would provide more information 
about what the program is achieving than the typical practice of having 
all students take the same fifty-item test at the same time. Furthermore, 
matrix sampling procedures could be used effectively with existing tests. 
Instead of all students taking one standardized test, would it not be desir- 
able to have a pool of such tests and have different samples take different 
tests? If the concern is to evaluate the program, then it would seem desirable 
to get information on a variety of possible outcomes regardless of whether 
these outcomes are a strict random sample of the universe of outcomes. 

The application of matrix sampling in an evaluation study requires 
several design decisions about populations of items or tests, observation 
occasions, and subjects to be sampled. Chadbourn (1969) studied the 
problem of maximizing generalizability under cost constraints and devel- 
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oped equations for determining the number of observations and number 
of tests that would yield the maximum generalizability coefficient given 
certain cost and design limits. The procedures should be helpful in design- 
ing evaluation studies, although most evaluators will need a manual or 
other instructions before they are likely to use the approach. The equations 
require that many values be known or have reliable estimates such as 
costs and population variances and covariances. That these values are not 
readily available to most evaluators will further inhibit the application 
of the procedure. Knapp (1968) identified and studied three crucial ques- 
tions for matrix sampling: 1) How many individuals should be tested? 
2) How many items are to be administered to each individual? and 3) 
Which items are to be given to each individual? He obtained answers to 
these questions by employing balanced incomplete block designs. 

Another development that has implications for evaluation and a 
relationship with matrix sampling is mastery testing or “criterion-referenced 
testing” as it is referred to by Glaser (1963). Descriptions of the national 
assessment program illustrate the relationship. (See Ebel, 1966; Merwin, 
1966; and Tyler, 1966.) Mastery or criterion-referenced tests are used to 
determine an individual’s status on a performance standard. This is in 
contrast with the more commonly used norm-referenced test in which the 
individual’s status is compared with some norm group. 

Beaird (1969) and Popham and Skager (1968) are working on inte- 
grations of matrix sampling and mastery testing notions. They are attempt- 
ing to identify the many behavioral objectives of educational programs 
and to develop a variety of items or procedures for measuring the attain- 
ment of the objectives. The outcome of these projects should be a large 
pool of items for each of the many objectives. This pool could then be 
A population for sampling items and the items could be used by evaluators 
for determining degree of mastery of the objective. 

Mastery testing is certainly not new, but its connotation of absolutism 
made it dormant for several years, at least among writers on measurement. 
Classroom teachers probably never did stop using it, as witness the “per- 
centage-correct” method of test Ce that is ta; ery 
testing regained respectability with the programed instruction movemen 
and oor U we a useful measurement procedure. (See Gagné, 
1962; Glaser, 1963; Bloom, 1968; Horn, 1966, 1968; Ebel, 1965, 1968; 
Popham and Husek, 1969; and Walbesser and Carter, 1968.) Mastery 
testing is certainly relevant to educational evaluation. To be able to 
say that a certain proportion of students mastered or failed to master this 
item may be more relevant than to say how their performance compated 
with some norm group. Furthermore, mastery testing combined with matrix 
sampling ideas will provide such information about a large number of items. 
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Cronbach (in press) discussed concepts of validity that are relevant 
to mastery testing. It appears that the validity of mastery test items is 
primarily a content validity question, i.e., does the item adequately repre- 
sent the content to be mastered. If the content, however, is a construct 
(e.g., problem solving) then one would have a construct validity problem. 
Wardrop (1970) discussed the use of item-characteristic curves in con- 
structing and selecting mastery items. 

An evaluator might often want to use mastery items, but he also 
might want to differentiate among students. These two purposes seem 
somewhat incompatible. Carroll (1963) suggested that “time taken to 
learn” is an appropriate criterion measure for classroom learning research; 
it does seem reasonable that an evaluation study could use mastery items 
for determining what content was taught and use “time taken to learn” 
to obtain measures of individual differences. These studies would necessarily 
have to be quite restricted. 

Other developments in measurement that might be effective are sequen- 
tial testing and confidence weighting, although they have not been studied 
in an evaluation context. Sequential testing, a procedure in which presen- 
tation of a test item is contingent on performance on a previous item, 
might be used in combination with item sampling to build instruments 
with differing sequences of items. Sequential testing will become more 
realistic as computers become available for controlling the testing. Follow- 
ing are references on sequential testing: Linn, Rock and Cleary (1969); 
Cox (1965); Cleary, Linn, and Rock (1968a, 1968b); Cox and Graham 
(1966); and Smith (1968). Lord (1968) studied what he called “tailored 
testing” which appears to be sequential testing. His results suggested that 
such testing with the necessarily large pool of items is little, if any, better 
than using a smaller number of the most discriminating items as a con- 
ventional test. 

Confidence weighting, ie. having the student indicate the degree of 
confidence held for the answer, might be introduced with mastery testing 
to obtain a more precise estimate of the degree of mastery. See Rippey 
(1968); Michael (1968); Ebel (1965); Coombs;.Milholland and Womer 
(1956); Shuford, Albert, and Massengill (1966); Traub (1969); Archer 
(1969); Massengill (1969); and Shuford (1969) for discussions of confi- 
dence weighting and related topics. x, 
~ A primary concern in measurement has always been to define what 
is to be measured. Evaluators of educational programs may have defined 
their object as programs but they have generally measured students. Pro- 
cedures are available for measuring most program outcomes, and these 
procedures should be used at the expense of emphasis on procedures that 
are appropriate for measuring individuals, 
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Measurement Strategies 


The kinds of evaluation effort needed in education are complex and 
require accommodation of a large number of variables. Technology for 
handling the data once they are obtained is available, but obtaining the 
data presents a serious logistics problem. The data gathering must be 
planned as a systematic effort. If not planned, the data will probably 
be a haphazard assortment of unrelatable numbers. 


Several systems are available to assist in planning the evaluation 
effort. Actually each of the evaluation models (reviewed elsewhere in this 
issue) is an evaluation plan. Scriven’s (1967) differentiation between 
formative and summative evaluation is a helpful way of planning evalua- 
tion, The article by Metfessel and Michael (1967) can be used to identify 
the many outcome variables and measurement procedures. A publication 
by the New England Educational Assessment Project (1966) could be 
helpful for an evaluator in planning the study as well as for identifying 
measures, The Program Evaluation and Review Technique (PERT) is a 
technique that was developed for evaluation purposes in the industrial 
context, Cook (1966) indicated the educational application of PERT and 
has argued for its relevance to educational evaluation. Grobman (1968) 
identified many of the issues and concerns facing the evaluator and 
provided some valuable suggestions for planning evaluations. Grobman’s 
experience with the evaluation activities for the Biological Sciences Cur- 
riculum Study and other projects has provided her with insights on tactics 
and strategy that should be helpful for an evaluator. 

A matter that should be considered in planning some evaluation is 
the possibility of aptitude-treatment interactions. If such are possible, then 
the evaluation design should be able to identify them. Cahen (1969) dis- 
cussed methodological considerations for identifying aptitude-treatment 
interactions. The bibliography of the paper contains additional references 
on the topic. 

With all of the plans and models available, the evaluator still has 
many decisions to hake on what data to gather. Glass (1969) offered 
a methodology for establishing priorities on data collection. He suggested 
that the priority would depend on, “1) the costs of gathering different 
data, 2) estimates of the prior probabilities that each alternative embodied 
in the decision will be supported by the data—if they were to be gathered, 
and 3) the costs of implementing each of the alternatives of the decision. 
His explication of these points is valuable reading for the evaluator. 


313 


| 


REVIEW OF EDUCATIONAL RESEARCH Vol. 40, No. 2 


Summary 


Educational evaluation is broadening in scope and is truly becoming 
an activity concerned with value, Asher (1969, p. 11) in discussing the 
comprehensive nature of evaluation wrote: 


Evaluation, then, is more than and broader than research design, 
measurement theory, and statistical analysis, but ultimately com- 
pletely dependent on them. Evaluation completely encompasses 
these methodologies and, if these methodological components 
are weak, the evaluation which builds upon them can be no 
stronger. Just as indices of validity can be no greater than the 
reliabilities of the component measures, evaluation can be no 
better than the internal validity of the information presented to it. 


The increased comprehensiveness of evaluation efforts and a recog- 
nition of what is being evaluated has required an expansion of the num- 
ber and type of measurements included in the evaluation. Fortunately 
there have been many developments in the measurement field that are 
relevant to the requirements of educational evaluation. Observation systems, 
interaction analysis, matrix sampling, generalizability theory, computer- 
controlled testing, and mastery testing are all relatively new developments 
that have important potential for improvement of evaluation. As in all 
areas of education, however, much work is needed for an adequate tech- 
nology to be available. 


What are the needs? The following are important: 


(1) Prototypical studies and manuals for administering and inter- 
preting studies using matrix sampling and generalizability theory are 
needed. Most of the work in these areas has been at the theory level, 
and even the few prototypical studies seem to be written for a specialized 
audience. Little application of the procedure will occur at classroom, school, 
and district level until the concepts and methods are translated for the 
evaluator working at such levels, Furthermore, studies are needed that 
indicate the kinds and magnitude of error that are likely to occur when 
the assumptions of the techniques are violated. 


(2) The broadened concept of educational evaluation requires meas- 
urement and classification procedures for variables that are not psychologi- 
cal. Consequently much work is needed in adapting the methods of the 
social and behavioral sciences to educational evaluation. The models of 
anthropology, economics, political sciences, sociology, law, and journalism 
need to be studied for their applicability to evaluation. Some of this analysis 
is occurring, e.g. cost-benefit analysis. It is clear, however, that the models 
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are not immediately applicable to educational evaluation because many 
variables that are used in the basic discipline are not present or must 
be measured in another way in education. The problem of adapting such 
models to educational situations creates measurement problems of its own. 


(3) The comprehensiveness of evaluation requires many decisions by 
the evaluator. Study is needed on viable procedures for decisions on what 
to measure and how to handle the mass of data. 

Evaluation is emerging as a field in its own right; hopefully, evalua- 
tion will contribute to substantive change in education as well as to meth- 
odological advances in the study of education. 
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EDITORIAL 


The first volume 
of the Review of Educational Research which was published in 1931 con- 
tained a complete listing of all 329 members of the American Educational 
Research Association! AERA has grown to nearly 10,000 members, and 
educational research has grown with it. As the organization and the dis- 
cipline have changed, AERA’s publication program has changed. In 1964, 
the American Educational Research Journal was created as an outlet for 
original contributions to educational research. In April 1969, the Editorial 
Board of the Review of Educational Research proposed to the Association 
Council that the nearly forty-year old Review be reorganized so it might 
better serve the purposes of the Association’s publications program. 

Since 1931, the Review has published solicited review manuscripts 
organized around a topic for each issue. Generally, a slate of fifteen topics 
was chosen by the Editorial Board and each topic was reviewed once each 
three years in one of the five issues per volume, An issue chairman was 
selected by the Editor and given the authority to choose chapters and 
authors for the issue. The Review has served well for many years both the 
discipline of educational research and the profession of education. As an 
organization and as a profession, we are grateful for the generosity and 
efforts of the more than 1000 scholars who have contributed to the Review. 

The purpose of the Review has always been the publication of critical, 
integrative reviews of published educational research. In the opinion of the 
Editorial Board, this goal can now best be achieved by pursuing a policy of 
publishing unsolicited reviews of research on topics of the contributor’s 
choosing. In reorganizing the Review, AERA is not turning away from the 
task of periodically reviewing published research on a set of broad pram 
The role played by the Review in the past will be assumed by an Annua 
Review of Educational Research, which AERA is currently planning. The 
reorganization of the Review of Educational Research is an acknowledg- 
ment of a need for an outlet for reviews of research that are initiated by 
individual researchers and shaped by the rapidly evolving interests of 
these scholars. : ae i 

education is not increasing nearly as tast as is 
dei tnie A of the educational research literature is 
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obvious. A body of literature can grow faster than a body of knowledge 
when it swells with false knowledge, inconclusive or contradictory findings, 


ledge is no less important than the discovery of new knowledge. 
It is hoped that the new editorial policy of the Review, with its implicit 
invitation to all scholars, will contribute to the improvement and growth 
of disciplined inquiry on education, 


The third number of the fortieth volume of the Review is the first issue 
to be published under the new editorial policy, 


Gene V Glass 
Editor 
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THE CONCEPT OF 
MATHEMAGENIC ACTIVITIES’ 


Ernst Z. Rothkopf 
Bell Telephone Laboratories, Incorporated 


Psychologists write from time to time in 
human language. Some years ago, I submitted the report of an experiment 
about mathemagenic behavior to a journal. The article started with the 
sentence, “You can lead a horse to water but the only water that gets into 
his stomach is what he drinks.” The editor, probably judging this to be too 
alimentary, deleted the sentence. I regretted this not only because the little 
phrase pleased me but also because the problem of the not-drinking horse 
was and is a useful metaphor for explaining why the study of mathemagenic 
activities is a challenging enterprise for the educational psychologist. 


The proposition is simple. In most instructional situations, what is 
learned depends largely on the activities of the student. It therefore behooves 
those interested in the scientific study of instruction to examine these 
learning activities, i.e., the “drinking habits” of students. 


The singular importance of certain learner activities first impressed 
me in connection with a theoretical analysis of frame formats in pro- 
gramed instruction (Rothkopf, 1963b). Student responses and the immediate 
feedback of knowledge of results had been interpreted in that context as 
having a direct effect on the acquisition of subject matter knowledge. 
Analysis led to the rejection of this interpretation and to the belief that 
these operations affect the inspection activities of the students instead. The 
inspection activities then determine what is learned. 

A similar conclusion was reached in nc i aina co 
like phenomena in earlier ex riments on learning trom wri en sentences 
(Rothkopf, 1962, 1963a; As & Coke, 1963, 1966). This prompted me 
to coin the word mathemagenic to refer to attending phenomena. It is 
derived from the Greek root mathemain—that which is learned and 
gignesthai—to be born. Mathemagenic behaviors are behaviors that give 
birth to learning. More specifically, the study of mathemagenic acing 
is the study of the student’s actions that are relevant to the achievement o 


specified instructional objectives. 


1. This paper is a somewhat expanded version of a talk delivered at the 1969 meetings 
of the American Educational Research Association, Los Angeles. 
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The concept of mathemagenic activity implies that the learner’s actions 
play an important role in determining what is learned. This concept is 
closely related to the distinction between nominal and effective stimuli in 
learning. The distinction has been most coherently stated in recent years 
by Underwood (1963), although the history of this idea can be traced at 
least as far back as Gustav Fechner (1860). Basically the distinction is 
that the nominal stimulus—the stimulus object presented by the experi- 
menter or teacher—is not in simple correspondence to the stimulus that 
has an effect on the subject. Discrepancies result from characteristics of 
the receptor surfaces and from acts by the subject or student which 
transform or elaborate the nominal stimulus. These acts have been called 
set, attention, orienting reflex, information processing, cognition, rehearsal, 
and so on. All of these acts fall within the broad boundaries of the term 
mathemagenic activity, as it is used here. These considerations suggest a 
functional view of mathemagenic activities, namely that they determine the 
nature of the effective stimuli in experimental or instructional situations. 
The character of the effective stimuli, in turn, determines what is learned. 


Definition 


Consider how students learn from written material. What determines 
— capabilities a student has acquired after exposure to an instructional 
een Bing Riri of the instructional material is undoubtedly 
F seis he dk Fapid degree, its organization. But most important, to 
SRA iai en is what the student does with the instructional 
cp the student has complete veto power over learning, 

ut some activity on his part the instructional objectives can never 


be achieved. 


die Bist ieee enter the instructional situation. He must procure 
tb tranalita the EAN. and place it before him. He must then proceed 
id evi nha “at of printed letters into internal representations and 
suifictent)-alf ane hited ways. Although none of these actions is 
peenei iat bart The manner in which these acts are 
aee ladle EM at is learned. A serious interest in the effective 
describe and t nstructional process requires a serious attempt to 

o understand those actions on the part of the student that 


A ñ f 
i y W the attainment of instructional objectives. That is the chief 
purpose of the study of mathemagenic behaviors. 


pe of ee ae saat human activities may be defined as the 
objectives. For an ei iy the achievement of specified instructional 
naklo SE A instructional objectives, activities in specific 
(2) mathemageni y tall into four categories: (1) mathemagenic positive, 

agenic negative, (3) mathemagenic neutral, and (4) mathe- 
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magenic unknown. Mathemagenic positive means that the activities are 
conducive to the attainment of the specified instructional objectives; mathe- 
magenic negative means that they interfere. 


For convenience in writing, only activities in category | or 2 will be 
referred to as mathemagenic activities. This results in a slightly expanded 
version of the working definition described earlier, namely that 
mathemagenic activities are those student activities that are relevant to the 
achievement of specified instructional objectives in specified situations or 
places. 

The importance of the terms specified instructional objectives and 
specified situations should be emphasized. Reference to specified instruc- 
tional objectives is important for logical as well as for practical reasons. 


Those in learning and in instructional research are beyond that stage 
of innocence where they can speak of learning as if it were some simple 
deposit with only one dimension, namely quantity. Exposure to a set of 
instructional stimuli results in many learned consequences which are 
reflected in later performance. These learned consequences can not be 
fully enumerated for any given instructional episode because it 1s not 
known how it can be done. One of the key difficulties is that the main 
measuring operations, i.e., testing profoundly affect what researchers w 
to measure later (Rothkopf, 1969). 

Because the learned consequences of an instructional episode are varied 
and difficult to determine, any definition of mathemagenic activity that is 
broad enough to encompass all activities that produce any learning (or 


performance changes) in any situation is too broad to be useful. It is more 


practicable to speak of mathemagenic activities that are relevant to 4 
restricted set of performance objectives or to classes of such objectives. 


The distinction between performance objectives and classes of such 


objectives must for now remain somewhat hazy. ‘An example of an objective 


is that a student “be able to describe in specified detail the setting in which 
i ” An example of a 
5 Ay: be 
class of objectives to which a single mathemagenic act vity pattern may 
relevant is to describe the natural settings in which to Toa each of the 
mushrooms that are mentioned in a particular book on mycology. 
ic activity requires the 


The reason that the definition of mathemagen 
specification of situations or places has a similar basis. pare eoe 
in different situations may depend on different actions by t . student. t is 
quite clear that it is useful to speak of fairly broad classes see eae i 
places. Two different specifications of situations should be penne * 
(a) instructional settings and (b) specific characterizations o! papura 

materials. Examples of the first might be “short written mane p r y 
carrels” or “motion pictures in classroom groups. Different patterns 0 
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mathemagenic activities may be expected to support the achievement of the 
same instructional objectives in these two settings. 


For certain instructional objectives in the second characterization, it 
is feasible and useful to describe the referent instructional situation in 
considerable detail. For example, the specification of the instructional 
situation may consist of an effort to describe the organization of the 
instructional material. This has proven to be particularly useful in the 
analysis of the mathemagenics of written instructional materials (Frase, 
1969; Frase and Silbiger, submitted for publication). 


In principle, the character of mathemagenic activities with regard to 
each instructional objective has to be discovered for each separate 
instructional situation or place. However, there are general classes of 
situations or places and classes of objectives for which a general description 
of mathemagenic activities may be made. An example of such a general 
class of situations is that of acquiring from written instructional materials 
the ability to verbally describe scientific principles. 


Forms of Mathemagenic Activities 


For the use of written instructional material, several different classes of 
activity with mathemagenic significance may be distinguished: 


7 Class I. Orientation: Getting Ss into the vicinity of instructional 
objects and keeping them there for suitable time periods. In certain institu- 


tional settings, this also involves control over activities that may distract or 
disturb other students. 


tiers x Object Acquisition: Selecting and procuring appropriate 
ctional objects. Maintenance of selection and procurement activities. 


Class III. Translation and Processing: Scanning and systematic 


eae on the instructional object; translation into internal speech 
terat representations; the mental accompanyments of reading: dis- 
crimination, segmentation, processing, etc. 


je ies activities subsumed in Classes I and II are of a 
ile Bing eo are directly observable, and are relatively easy to 
directly bevab activities include only a few components that are 
the musculature of ope those involving the extraocular muscles and 
activities uia e throat and buccal cavity. Measurement of these 
mathemagenic de certain degree of technical sophistication. The remaining 
structs, The ate beso : Class III have the nature of hypothetical con- 
ft thidida about me useful in the prediction of interesting outcomes and 
vations. The ede cee but they are inferred indirectly from obser- 
stnsly Weceuiag ical Inventions in Class III are designated mathemagenic 
id use they are (a) by attribution activities of S and (b) because 


328 


ROTHKOPF THE CONCEPT OF MATHEMAGENIC ACTIVITIES 


they are theoretically designated to affect substantive learning relative to 
objectives. 


Discovery of the Mathemagenic Relevance of Activities 


The discovery of the mathemagenic relevance of activities is in principle 
simple although the task may actually be quite complicated. The principle 
is really a definition. Activities of Classes I and II are mathemagenic if it 
can be shown by the usually accepted methods of experimental inference 
that they are relevant to the attainment of the performance specified by an 
instructional objective. 


For the indirectly observable components of Class II activities, matters 
are more complicated. When it is discovered that certain treatments 


without being themselves instructive, & hypothesis about the effect of the 
treatment of mathemagenic activity becomes attractive. Class III activities 
are hypothesized when it is clear that the instructional materials were at 
least within the perceptual range of S. Or to say this in a complicated but 
more exact way, Class III activities are hypothesized under the conditions 
stated earlier, if Class I and II activities have been observed to have 
occurred in instructionally adequate patterns. 


The Nature of Category III Mathemagenic Activities 


Class III refers to what is commonly called reading. Reading is a 
multilevel process. It seems useful to distinguish among three classes of 
actions: (1) translation, (2) segmenting, and (3) processing. The actions 
from translation onward, represent progressively increasing independence 
from direct stimulus control. This is meant in the sense that the hypothe- 
sized actions at the processing level are less directly mappable on nominal 
ation level. Actions at these three levels 
are also distinguished by progressive decreases, from translation to process- 
ing, in accompanying observable muscular activities. For convenience, the 
ing action from translation to processing will be 
described as “deeper”. This is simply a summary description and refers 
to progressive differences among the thr 
control and observable muscular activities. 

f action in reading have memorial consequences. 


All. thre Caico ore complex and enduring as the depth 


Memorial consequences become ™ 


of the actions increases. ae 
i i i i three levels are 

In learning from written material, actions at the thre y 
: n observe is so limited, it would 


thoroughly entwined. Because what one ca ; 0 
be difficult to disentangle the activities at various levels on any given 
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occasion. The various transactions probably differ considerably among 
persons who were trained differently and for any given person from one 
time to another. The conception of different levels of mathemagenic activ- 
ities is interesting because of the theoretical conjecture that activity at any 
given level may deteriorate until mathemagenic effects are nil or become 
negative. It is hypothesized that deterioration is always from the deep 
central level (processing) towards the peripheral (translation). 


Translation 


The first input for learning is created in the translation stage. This 
stage involves visual fixations on the written stimulus and some degree of 
tion. The muscles involved in translation include the extraocular 
muscles and the musculature of the upper respiratory tract. The hypothetical 
component includes semantic decoding and the creation of internal repre- 
sentations. The pattern of eye fixations varies widely in duration, location 
sequence, and spacing between fixation. Vocalizations vary from loud 
shouting (as in reading of roll calls to a large number of people outdoors) to 
that are only detectable by electromyographic techniques and have 

no ordinarily audible consequences. 


Hypothetical components of Class III activities can, in part, be indexed 
by the observable activities of the translation stage. Translation has at 
least two identifiable states that differ in the frequency and regularity (and 
perhaps duration) of eye movement and fixations. One state is characterized 

ent and irregularly spaced fixations that identify scanning or 

se E The other state is identified by more frequent and regularly 

siak om oo associated with the acquisition of systematic skills from 

an aga j Both states can be conceived of as being classes of sampling 

ti fc i inferences about the translation processes and perhaps 

The. T may become possible by mapping eye movements on text. 
approach is at present hampered by technical measurement problems. 


Segmentation 


Segmentation is the motor 
aspect of reading activity 
subvocal articulations produced 
conjectured that segmenting invol 
The most likely candidate for a 
mentation is intonation and inflec 


Functi i 
within KEA = Segmentation stage of reading establishes connection 
thi acuta ng o utterances or near-utterances that are produced in 
on stage. The activities of this stage result in the “belonging” 
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serves to break the string of vocal and 
by the translating activities. It may be 
ves some of the musculature of respiration. 
class of activities that accompanies seg- 
tion in reading aloud. 
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effect that Thorndike (1922, pp. 64-73) described. The principle may be 
illustrated by the simplified example experiment described by Woodworth 
and Schlosberg (1954, p. 711). The sentences “John Smith is a psychologist. 
Henry Jones is an astronomer. Walter Hodges is a biologist” are presented 
to Ss. Following this, they are asked “Who is the psychologist?” and the 
majority answers “John Smith.” This occurs despite the fact that John 
Smith is more remote from the word psychologist in terms of number of 
interposed words than Henry Jones. Segmentation describes a class of 
activities that results in linkage among terms although this linkage may 
not always be appropriate with respect to instructional goals. These activities 
are probably analogous or related to the prosodic or rhythmic patterns 
used in reading aloud. 

There are at least two ways of conceptualizing the hypothesized 
segmenting activity in reading. One of these is that “interpretation” in some 
fairly deep sense precedes segmentation (or intonation). This view would 
be in keeping with the current Zeitgeist in psycholinguistic enclaves of 
experimental psychology, but I do not believe it is correct. The second 
conception is that segmenting is under relatively peripheral control and 
that segmentation provides the perceptual “priors” for various interpretation 
hypotheses. 

The possibility exists that the translation and segmenting system are 
mutually interactive. Buswell’s observation (1920) that the eye-voice span 
grows shorter near the end of sentences suggests such interaction. Hower, 
the interpretation of Buswell’s result is not clear because of the well- 
known sharp drop in constraint across sentence boundaries. 


Processing 


Processing is the third, deepest stage 
It is the stage about which least is known. i 
as Thorndike (1917) have also speculated about its role. 
appealing as a theoretical locus where cross-sentence interpreta 
place and where review processes are initiated. 


of the general reading process. 
Other students of reading such 
Processing is 
tion can take 


Control of Mathemagenie Activities 


Classes I and II. During the last few years progress has been made in 
identifying some of the procedures for controlling Class I a ee, 
Class II mathemagenic activities. The research of Hall, Lund an 4 Jac 4 
(1968), Hamblin and Buckholdt (undated), O'Leary (1968), Sail ahi 
(1968), and Walker and Buckley (1968) are examples of this work. It has 


tigators in school situations with individual study 
been shown by these investlg' sitting, attending, noise 


settings that children’s activities such as moving, 
making, temper tantrums, etc, can be shaped, strengthened, and ex- 
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tinguished. A variety of instrumental conditioning procedures‘ such as 
token reward systems have been used for this purpose. It appears feasible 
to make these control procedures work in the usual school settings. It is 
therefore a little surprising that the so called “centers of educational 
innovations” have not given more attention to this work. 


Class Ill. Directly Observable Components. Carmichael and 
Dearborn (1947), in their pioneering study of prolonged reading, observed 
that the incidence of regressive eye movements and other eye movement 
patterns that indicate decreased reading efficiency during prolonged reading 
(4-6 hours) can be delayed by use of adjunct questions. Schroeder and 
Holland (1968) reported that eye movement patterns in search can be 
modified by consequences and are probably under instrumental control. 
Some interesting indications of respondent enhancement and suppression 
of electromyographic potentials from the throat and face during reading 
me by McGuigan and his co-workers (McGuigan and Rodier, 


Hypothetical Components. Very substantial evidence is available 
that the deeper mathemagenic activities in Class III can be modified by 
directions and by the use of adjunct questions. The directions used in this 
experimental work have been of two kinds: (a) directions of.intent and (b) 
manipulative directions, particularly directions to search. 


Postman and Senders (1946) in a provocative pioneering study, found 
that directions to learn specific classes of information from text may 
facilitate learning. However, it was shown by these investigators that 
facilitation was not necessarily in keeping with the intent of the directions. 


The results of three studies (Bruning, 1968; Rothkopf, 1966; Tenenberg, 
1969) indicate that vague hortatory directions in intent affect Ss’ mathe- 
eee activities sufficiently to elevate postreading test performance. Frase, 
th late ie experiments (Frase, 1969; Frase and Silbiger, in press) 
hight at directions to find certain items of information in a text produces 
ae hg “arian sa of incidental learning. He made these predictions 
Class III Th of inferences about presumed mathemagenic activities of 
th > these inferences were derived jointly from the organization of 

: Re text and the nature of the search directions. 

everal studies are now available (Bruning, 1968; Frase, 1967; 
pis ie hae Rothkopf and Bisbicos, 1967) a show that adjunct 
oo s zi ministered shortly after inspecting the text segment to which 
D ate at affect mathemagenic activities. In the studies referred to 
su tio t ange reflected itself in post-training test performance that was 

perior to the performance of (a) no-question control groups, and (b) 


groups who saw questions pri i i ; 
3 prior to inspecting the text segment to which 
the questions were relevant. agg ; 
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The experimental results on adjunct questions have not always been 
correctly understood. The first major misunderstanding has to do with 
the experimental technique, the second with the interpretations of the 
findings. Basically, the adjunct question experiments at the Bell Telephone 
Laboratories were incidental learning studies. At the simplest level they 
involved measurement of performance changes on reading material that 
was not directly related to the text components on which the experimental 
questions were based. Hence, any measured behavioral changes could not 
be attributed to the direct-instructive-effects of questions’ content. The 
changes were ascribed instead to changes in presumed mathemagenic 
activity during inspection of the text. These activities, though not directly 
observable, were measurable by their effect on specified objectives as 
sampled by the criterion examination and are, therefore, of a positive 
mathemagenic character. 


Several investigators performed experiments in which the criterion test 
was derived from exactly the same material as the text question or in which 
the absence of transfer from the content underlying the experimental 
questions to the content underlying the criterion test was not experimentally 
established. It is difficult to interpret results from experiments such as these 
because it can not be determined to what extent criterion test performance 
reflects changes in mathemagenie activities or the direct instructive effects 
of questions. 


The second misunderstanding concerns findings. Nothing that has 
been reported from the Bell Telephone Laboratories experiments would 
support the conclusion that adjunct questions are always mathemagenically 
positive. Rothkopf and Coke (1963, 1966) found that test-like events 
occurring during training can have negative effects on performance. The 
Rothkopf (1966) and Rothkopf and Bisbicos (1967) studies simply 
provided additional data to the existential proof that test-like events of a 
certain character can result in activities that have positive mathemagenic 
effect. But I believe that this is essentially a secondary finding. In my 
opinion the most interesting experimental result in the adjunct question 
experiments was that mathemagenic activities are adaptive. This means that 
mathemagenic activities can be altered by their consequences and oe 
therefore, the shaping of mathemagenic activities in an instructional fas ion 
by environmental events (or contingencies) is a practical possibility. It 
should be noted here that this is not tantamount to the assertion that all 
mathemagenic activities are under instrumental control. In fact, the 
direction of experiments cited earlier (Bruning, 1968; Frase, 1969; Frase 
and Silbiger, submitted for publication; Rothkopf, 1966; Tenenberg, 1969) 
-suggests that mathemagenic activities sometimes have a respondent 


character. 
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Status of Mathemagenic Activity as a Scientific Concept 


Efforts toward early theoretical formalizations about mathemagenic 
activities would probably be both premature and somewhat presumptuous. 
The present paper and its related predecessors (Rothkopf, 1963b, 1965, 
1968, 1969) are not discussions of theoretical conceptions; they are mainly 
delineations of a class of scientific and practical questions that are probably 


related. 


Questions that seem to be of great practical and scientific consequence 
include the following: What are the more critical student activities that 
affect what is learned in the more important instructional settings? What 
are the factors that modify mathemagenic activities? What causes positive 
mathemagenic activities to deteriorate? How can one give a coherent 
account of Class III activities that is consistent with what is known about 
visual perception, memory and language processes? 


Practical Consequences 


Other than raising a number of scientific questions, the concept of 
mathemagenic activities suggests a general strategy for approaching the 
scientific management of instruction. 


The concept of mathemagenic activities tends to shift emphasis from 
investment of resources in the development of instructional materials to 
investment in the instructional environment. This is to a certain degree 
contrary to the current Zeitgeist which has been characterized by very 
careful design of instructional materials. 


Efficient designs have been sought through applications of man’s 
embryonic understanding of human learning processes. This approach, 
which I have also called (Rothkopf, 1968) “the calculus of practice” has 
received widespread vocal support, but has not been by and large very 
successful. The other approach has been through the experimental valida- 
tion of documentary instructional materials, i.e., systematic tryouts and 
revisions. Although this method has not been faulted in principle, it has 
not gained wide acceptance, probably because of economic factors in 
instructional publishing and also, to some degree, because of problems that 
arise in the choice of instructional objectives. When this method has been 
used in a thorough and conscientious manner, it has generally proven to 
be successful but very expensive. 


The concept of mathemagenic activities suggests another approach. 
Instructional materials are accepted within some limits as givens. Emphasis 
in instruction is on promoting those activities in the student that will 
allow him to achieve instructional goals with available materials. This 
is truly a student-centered approach. How to manage this poses many 
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practical problems but it may be a more economical than expensive con- 
centration on the detailed design of instructional material. 
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BOUNDARY CONDITIONS FOR 
MATHEMAGENIC BEHAVIORS' 


Lawrence T. Frase 
Bell Telephone Laboratories 


Herbert Spencer once said that when a man’s 
knowledge is not in order, the more of it he has the greater will be his 
confusion. There is more knowledge about instructional processes available 
today than ever before; and there is more confusion. Educational psy- 
chologists write that educational psychology seems to be pe Pe UT 
superficial, ill-digested, and typically disjointed and watered-down 
miscellany of general psychology, learning theory, developmental psy- 
chology, social psychology, psychological measurement, psycho of 
adjustment, mental hygiene, client-centered counseling and child-centered 


has produced a tremendous quantity of empirical research studies, many 
of them without thoughtful conceptualizations, without explicit responsi- 
bility for developing theory of instruction, and without contribution to 
knowledge about instruction” (Wittrock, 1967, p. 1). As Spencer warned, 
intellectual disorder (in educational psychology) has resulted in confusion. 
As suggested in the quotes above, this disorder stems in part from two 
afflictions that have plagued the research community. 


The first and most pernicious affliction has been the failure to set 
research priorities—to isolate problems of general importance and to 
persevere in an attack on them. The consequence has been the accumulation 
of fragmented knowledge about a variety of relatively unimportant instruc- 


tional topics. It is encouraging to note the emergence of one especially 


important research topic in the writings of ‘Ausubel (1963), who emphasized 


the study of meaningful verbal learning; Rothkopf (1965) who made some 
important conjectures about learning from written materials; and Carroll 
(1968, p. 1), who stressed the study of learning from being told. This 
emergent topic is the study of how people learn from ordinary text. 

The second affliction is that researchers have neglected to relate data 
to theoretical conceptions. It is difficult to envision a coherent body of 
knowledge about instructional processes without an evolving conceptual 


l. This paper was presented at the Annual Meeting of the American Educational 


Research Association, Los Angeles, 1969 
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scheme within which to view this knowledge. The vague outlines of such 
a conceptual scaffold are implicit in the definition of mathemagenic 
behaviors, i.e., behaviors which give birth to learning. To understand what 
behaviors produce learning is, in part, to describe the learning outcomes 
that result from performing various operations upon written materials and 
the conditions which determine those operations. To describe these con- 
ditions is to conduct related studies that clarify critical variables and 
therefore sharpen original conceptions. Such research can be useful if it 
leads to analyses of what is meant by obscure concepts such as discovery or 
intentional learning, in terms of the specific responses which those concepts 
entail. The ability to control those responses in the context of conventional 
written materials can aid in the development of flexible instructional 
techniques. 


In this paper I describe research related directly to the mathemagenic 
approach. Much of this research has been concerned with how test-like 
events affect learning. In this paper, the motivational level of the learner, 
characteristics of text, and several variables related to test-like events are 
seen as jointly determining learning outcomes. 


Boundary Conditions for Mathemagenic Behaviors 
Orienting Directions 


An orienting direction is a verbal stimulus which disposes the reader 
to respond to certain aspects of a text. General instructions to learn (Roth- 
kopf, 1965), “advance organizers” (Ausubel, 1963), and various cueing 
strategies (Anderson & Faust, 1967) are classes of orienting directions. The 
general label “orienting directions” is intended to emphasize that questions 
are only a sub-class of events that might be used to insure the occurrence 
of appropriate learning behaviors. 


l Three characteristics of questions influence learning: (a) their position 
in a text, (b) the contiguity of questions and related content, and (c) the 
type of question. 


Position of questions in text. An important characteristic of questions is 
their position relative to the related content. A simple change in position can 
radically transform consequent reading behaviors. Rothkopf (1965) reported 
a study in which he inserted questions in ordinary text either before or 
after the material to which they related. His experiment was especially 
revealing because he looked at how much readers learn from the text to 
which the adjunct questions relate (the relevant information), and how 
much they learn from the text which is not related to those questions (the 
incidental information). In general, he found that Ss learn most when the 
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questions come after the material to which they relate. This finding was 
replicated, with different materials and Ss, at least six times (See Frase, 
1968a, for a review of some of this research). In these replication studies, Ss 
were not permitted to review the text, and in most of them knowledge of 
results was not given with the adjunct questions. 


These studies have shown that questioned groups, in comparison to a 
control group which did not see adjunct questions, retained more of the 
question related material. Post-question groups retained somewhat more 
incidental information, but the pre-question groups retained relatively 
little incidental information. In some cases (Frase, Patrick & Schumer, 1970) 
pre-questions depressed incidental learning well below control group scores. 
Evidently pre-questions can limit the range of stimuli to which effective 
learning responses are made, The question related text stimuli cue atten- 
tional responses. 


Bruning (1968) confirmed that post-questions, in comparison to review 

statements, have an effect on learning which is additive (in the statistical 
sense) to a review effect. Thus, the same set of questions might function as 
review, as confirmation for mathemagenic behaviors, and to select specific 
text cues which activate or inhibit certain behaviors. 
Contiguity of questions and related content. If pre-questions interact with 
the text to restrict the stimuli that are responded to, then placing questions 
close to the related material should increase their selective effects. In one 
study (Frase, 1968b), the problem of contiguity was explored by placing 
questions either before or after every 10, 20, 40, or 50 sentences taken from 
a textbook in introductory psychology. The same full set of questions was 
given to all Ss, but the questions occurred in groups of 1, 2, 4, or 5, depend- 
ing upon their spacing within the text. 

A multiple-choice test, given after Ss read the material, showed that 
retention of the incidental material decreased substantially for the pre- 


questioned groups when the questions occurred every 10 sentences, but that 


i i in inci d relevant 
the post-question groups showed an increase in incidental and 
ad base There were no differences 


retention with questions every 10 sentences. n 
in retention between the pre-and post-question groups when the questions 
occurred every 50 sentences; but when a question occurred after every 10 
sentences, the post-question groups scored about 40% higher on over-all 


retention than the pre-question groups. Therefore, contiguity of the 


questions and their related content might work either for or against over-all 


retention, depending upon the point at which the reader views a question, 


i.e., immediately before or after he reads the text. 
Patrick (1968), using the same materials, theorized that if Ss rehearsed 


pre-questions, then the questions would be available from memory for long 


periods of time. Even if the pre-questions were placed every 50 sentences, 
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they should still inhibit incidental learning, contrary to the previous 
experiment, because they would be present in memory to mediate stimulus 
selection. He studied the effect of having Ss in four groups (a) write out 
the questions they encountered while reading, or (b) write out what they 
thought were the correct answers, or (c) look at the question a second time, 
or (d) simply read the questions and text as written. 


Patrick’s prediction of depressed incidental learning was confirmed for 
the pre-question groups which had written out the questions. Rehearsing 
the response or reviewing the question did not significantly depress inci- 
dental learning. In short, a question in memory can mediate selective 
responding as though it were presented immediately before the related 
text. The incidental retention of post-question groups was not altered by 
rehearsal activities. 


In most studies reported in this paper, learning was tested with im- 
mediate retention tests. Patrick ( 1968) administered multiple-choice tests 
immediately after reading and one week later. His data showed that 
retention was lower on the delayed test, but that the effects of the questions 
remained the same. 


Natkin and Stahler (1969) suggested that retention might improve over 
time under high arousal. They varied arousal by pre-exposure to adjunct 
questions. Their hypotheses were that high arousal would depress immediate 
recall and that arousal deteriorates with repeated exposure to questions. 
Therefore, recall for aroused Ss should increase over time, and Ss who have 
not had pre-exposure to adjunct questions should be the most aroused. Their 
materials consisted of two biology passages with immediate and delayed 
completion tests over the incidental information in the second passage. A 
high arousal group saw adjunct questions only on the second passage, while 
three other groups saw questions on both passages, the first passage, 
or on neither passage. As predicted, only the high arousal Ss showed rem- 
iniscence effects. In general, questions on the second passage improved 
recall. Arousal effects evidently deteriorate rapidly, since the pre-exposure 
conditions in this study involved only 15 minutes of reading on the first 
passage. A long term retention test would be useful in experiments in which 
brief exposure to questions is employed (e.g., Frase, 1968c). 


Types of question. There is little doubt that the type of question posed is an 
important determinant of the behaviors that the reader will exhibit. For 
instance, Rothkopf and Bisbicos (1967) interspersed in a text those questions 
related to either common or technical terms. Their design involved pre- and 
post-question groups. The post-question groups that had seen technical term 
questions showed high recall of other technical terms. Questions, as orienting 
stimuli, might thus define category cues that can determine a broad class of 
stimuli to which the Ss will respond. 
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Frase (1968c) explored the effects of pre-questions that vary in the 
amount of material to which they relate. Subjects were asked to underline 
the amount of information in a text which would be required to answer a 
specific pre-question (e.g., “When was Jim Born?”), a comparative question 
(e.g, “Who is older, Jim or John?”), or a general question (¢.., “When 
were the men in the following paragraph born?”). The Ss responses indi- 
cated that the number of words perceived to be necessary to answer the 
questions increased in the order of specific, comparative, and then general 


questions. 


In a follow-up experiment (Frase, 1968c) in which other Ss were 
allowed a limited time to learn the text, it was found that immediate 
relevant and incidental retention was indeed a function of the type of pre- 
question; least learning occurred with the general questions and most 
learning with the specific questions. These findings can be taken to mean 
that learning behaviors might also be modified in accordance with the 
amount of information that must be processed. Analysis of the data showed 
that under high information load, learning can become especially selective: 
this selectivity is closely related to specific responses required by the ques- 
tions. The data of Natkin and Stahler (1969) suggest that a long-term test 
might reveal reminiscence effects with general questions. 


The specific effects of questions, verbal instructions, or other orienting 
stimuli is a crucial area for further research. Many of the studies reported 
in this paper employed only factual questions, which is unfortunate if the 


the laboratory and the pie oe aes a spat s 
directed graphs are being explored tor g in - 
ing directions, selective behaviors, and text characteristics (Frase, 1969b). 
Thus far, it has been possible to predict for some limited text structures 
the items which Ss will mention in free recall, how these items will be 
ences, and what sentences Ss will accept as valid 


combined to produce sent 
deductions ae texts, on the basis of an analysis of the texts and the 


questions which Ss were asked to solve (Frase, 1969b). 


Incentive 


i sting regularities appeared in the studies in which pre- and 
Ma p as dobet aids. Casual observations suggest that 
the post-questions might work best when motivation is low. Frase, Patrick, 
and Schumer (1970) conducted a study in which Ss were told that they 
would receive either 0, 3, or 10 cents for each correct response on & multiple- 
choice posttest. Adjunct questions occurred in the text at the rate of one 
every 10 sentences or five every 50 sentences, and either before or after the 
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material to which they related. All Ss in these groups saw the same set 
of questions. There were also incentive control groups that read the text 
without the adjunct questions. 

There was little difference among the control, post-, and pre-question 
groups at the highest level of incentive. At the moderate level, the pre- 
question groups were substantially below the control and post-question 
groups. At the low incentive level, the post-question groups were substantially 
above the control and pre-question groups. It seems clear that the adjunct 
post-questions were especially effective at the low incentive level, and that 
Ss were able, in part, to overcome the selective efforts of pre-questions under 
high motivation. 

Reading time and the amount learned increased substantially with an 
increase in incentive, but incentive interacted in an important way with the 
frequency of the pre-questions. Frequent pre-questions seemed to override 
the effects of incentive on relevant retention, or conversely, Ss only made use 
of the infrequent pre-questions if they were highly motivated. The effect 
of post-questions on relevant learning seemed little influenced by the 
incentive factor, whether questions were frequent or infrequent. 


Two points are worth making about this rather complex set of data. 
First, there seems to be an optimal distance between the questions and 
related text for maintaining the selective and facilitative effects of questions 
across a range of motivational conditions. Second, Ss may have important 
learning skills in their repertoires, but they do not make full use of them 
unless they are given proper incentive. 

The finding that poorly motivated Ss score low with pre-questions 
corresponds to results obtained by Faust (1967). He had college Ss learn a 
programed Russian vocabulary lesson. There were three motivation groups 
in his study: 1) a high group that was told that performance on the task 
related to academic and professional success, 2) a neutral group that did not 
receive motivational instructions, and 3) a low group that engaged in a 
boring task before the programed lesson. One-half of the Ss read a program 
in which the correct response to each constructed response frame was 
underlined in the stimulus portion of the frame. The other half of the Ss 
read the same program without the underlining. Faust reasoned that poorly 
motivated Ss might expend little effort and, hence, attend to only the 
underlined stimuli. Consequently, poorly motivated Ss would recall little 
of the relevant stimulus material from the frames. His results showed that 
recall increased with motivation when underlining was present, but not 
when it was absent. 


Characteristics of Text Material 


Under some conditions, rearranging the information in a text may 


342 


FRASE BOUNDARY CONDITIONS FOR MATHEMAGENIC BEHAVIORS 


alter the behaviors in which a reader engages, and consequently what he 
remembers. Frase (1969a) did not instruct Ss to learn anything, but only 
to find a specific bit of information in a text. In one study the text material 
described ten planets, each having three attributes (their distance from 
Earth, their terrain, and their atmospheric color). Each sentence contained 
the name and one attribute of a planet. One group of Ss saw these sentences 
arranged in 10 paragraphs, each describing a planet. The other group 
saw the same sentences, but they were arranged in three paragraphs: one 
gave the distances of all the planets, one their terrains, and one their 
atmospheric colors. Both groups were told to find the name of the planet 
that was a certain distance from the Earth, and had a certain terrain and 
atmospheric color. After finding this information, Ss were given unan- 
nounced recall and recognition tests. 


The idea underlying this study was that Ss would use the attributes 
as criteria for testing the sentences they encountered. But if, as in the case 
of the group that read the sentences organized by attributes, the information 
about a particular planet was located in separate portions of the passage, Ss 
would enter the name into memory briefly while they searched for the next 
attribute of that planet. The retention tests, as predicted from an analysis 
of these adaptive behaviors, indicated that the group that had to search for 
related information remembered about twice as many names. 


Frase and Silbiger (1968) used similar materials that described 15 
planets, each with 5 attributes. Four groups in this study searched for 
f the 15 planets. The are a 
they searched for was embedded in paragraphs containing irrelevant 
sentences about other planets. The results of this study showed that 


that were incidental to the senten 
names which were not the targets of the search task. Recognition s 
on a test administered one month afte i sreg H 
was practically no loss in Ss ability to identify the names they wap A 
from a large set of similar names. According to Ss’ verbal reports, t yi 
not “intend” to learn anything. The point to be made concerning t ~ 
studies is that retention was & joint consequence of the reader’s goal an 
the distribution of the information related to that goal. We 
What behaviors might be important when Ss are simp y told to learn 
i i 1969c) used 
differently organized passages? To study this ogres i caer 


i ‘hed 8 attributes of 6 chessmen. 
a tert Wi d one attribute of that man, so that the sentences 


might be arranged in any order. r 0 
48 seule For one ate each paragraph described all the attributes of 
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one chessman. For another group, each paragraph described only one 
attribute for all of the chessmen. For another group, the sentences were 
randomly ordered. Half of the Ss were told the conceptual structure of the 
passage (the categories of attributes and that there were six chessmen), 
before they were allowed to read the text. The Ss were given three trials 
of five minutes each for reading, and they wrote as much as they could 
remember about the passage after each trial. Thus, the major factors in 
this experiment were whether Ss were informed or uninformed about the 
conceptual structure of the text before reading and whether the passage 
was organized by name or attribute or was randomized. A numerical index 
of clustering was used to characterize the amount of organization of the 
stimulus texts and Ss’ free recall protocols. 


The free recall data showed that, as trials progressed, the informed 
groups showed increasingly superior recall, resulting in an interaction 
between conceptual pre-information and trials. The effects of this internal- 
ized, rather general orienting direction about text structure tended to show 
up in recall only after rather large amounts of information had been 
acquired. 

The randomly ordered sentences retarded learning and the two 
organized passages were learned equally well. Both of the well-organized 
passages resulted in strong sequential learning affects. Subjects tended to 
learn the passages in the order in which the material was presented. This 
was not the case for the randomized text group. Comments by Ss in the 
random group indicated that they had tried to organize the material by 
searching for the related information. This search resulted in depressed 
learning scores for the randomized group but in relatively well organized 
recall protocols for that group. In the previous studies (Frase, 1969a; Frase 
& Silbiger, 1968), an orienting direction required Ss to respond to the 
relationships among sentences. Learning was modified systematically when 
the sentences were disorganized. In the present study, Ss spontaneous 
attempts to organize sentences accounted for high recall clustering but low 
over-all recall when the sentences were not well organized. 


Summary 


The data reviewed in this paper suggest that mathemagenic behaviors 
can be usefully viewed as components of an adaptive system in which 
these behaviors are modified by characteristics of two kinds of inputs: 1) 
those that occur prior to encounters with the text and 2) those that are 
characteristic of the text. Control of both of these classes of verbal events 
provides access to important mathemagenic behaviors. A useful factor which 
emerged from the data cited here is the effect of the temporal relationships 
among certain components of this system. 
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Post-questions act as confirmation for important mathemagenic be- 
haviors, and it has been shown that these questions worked best when 
they occurred immediately after small amounts of text, Their effect may 
be to maintain appropriate responses on succeeding portions of a text 
(Frase, 1968d). It appears likely that the physical proximity of text and a 
related post-question can be especially important for retention. 

A pre-question interacts with the text to permit the selection of relevant 
information and the rejection of incidental information. It has been shown 
that these selective effects were especially strong when pre-questions 
occurred immediately before small amount of text. However, pre-questions 
need not occur frequently if they have been rehearsed and have achieved 
some stability in memory. Therefore, the physical proximity between text 
and a preceding orienting stimulus may not be critical for the control of 
learning. The differential effect of these temporal relations upon pre- 
and post-questions suggests that a different set of laws may apply to the 
effects of a question, depending upon its position. A factual pre-question 
functions to cue responses to specific stimuli. Placed after the related text, 
the same question can reinforce learning behaviors which are not limited to 
specific stimuli. 

Contiguity is important in maintaining the effects of pre-questions 
across a range of incentive conditions. Post-questions, however, are relatively 
free from variations in motivational level and are especially useful when 
motivation is low. 

Contiguity also applies to the characteristics of text, per se, since the 
distribution of related information in a text can exert important controls 
over the learning process. For instance, when Ss search a text for ae 
information, they select stimuli that maintain the connectedness of the 
information; these incidental encounters with stimuli, produce substantial 
learning. If a poorly organized text is to be learned in toto, without any 
orienting direction, then spontaneous selective behaviors may detract from 
the time available for storing information. Pt 

Structural information can act as a general orienting direction (or 
advance organizer; Ausubel, 1963) that controls learning behaviors per- 
taining to categories of information. The advantage of such category 
information would be expected to increase as more information is acquired. 
Research showed that informing the reader in advance about the structure 
of a text increased recall during the latter stages of acquisition. 

The importance of the temporal relationship among these input 
events, attested to by their potential for restricting or expanding the range 
of learning if used appropriately and by Ss’ attempts to establish contiguity 
where it was lacking, reveals something about the control of mathemagenic 
behaviors. These input events, whether they are orienting directions or 
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separate sentences of a text, are unstable but interdependent components 
of the system—as cognitive events they decay rapidly unless steps are 
taken to maintain them. With the decay of these components, control over 
the process of learning is relinquished. 


Questions are motivational stimuli. They have arousal and associative 
outcomes, Studies reviewed in this paper emphasize their latter function. 
The data suggest that questions effectively transfer learning activities from 
some text stimuli to others. An important problem is to determine what 
kind of learning activities can be made contingent upon the discriminative 
text cues that are defined by questions. Richards (1956, p. 66) used 
specialized symbols, placed around text words, to cue different analytic 
behaviors during reading. Research and development of such techniques 
can provide insight into mathemagenic behaviors and make it possible to 
program learning from ordinary text. 
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CONTROL OF STUDENT MEDIATING PROCESSES 
DURING VERBAL LEARNING AND INSTRUCTION’ 
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The gencral thesis of this paper is that the activities the student 
engages in when confronted with instructional tasks are of crucial im- 
portance in determining what he will learn. The alternative is to view the 
student as a passive receptacle whose learning and performance are directly 
determined by input variables. In the latter view, long the dominant one in 
psychology as well as in education, the function of the educational 
is to develop what Rothkopf (1965, 1968) calls a “calculus of practice”. 
However, if the student is inevitably an active agent in his own learning, 
it is important to consider an approach in which the emphasis is upon 
discovering ways of managing the student activities which give rise to 
learning. 

One cannot be sure what a student is doing when he is looking at the 
pages of a textbook. He may be reading every line or he may be skimming 
the page. He may test himself on the implication of what he reads, but 
he may not. He may give selective emphasis to certain sections, as students 
seem to do when they underline portions of a text. The student’s emphasis 
is not necessarily the emphasis that the teacher desires. The student may 
spend more time on sections that he has trouble understanding, or he 
may skip difficult sections. If the student gets bored or tired he may 
begin to daydream or even go to sleep. A similar analysis could be made 
of the behavior of students confronted with a lecture, demonstration, class 
discussion, or self-instructional program. 

As Rothkopf (1968, p. 115) noted, “The instructor is in control of 
the nominal stimuli. In this way he can arrange for potential stimulation 
of the student. Whether these potential stimuli become actual ones depends 
critically on some actions by the student.” The traditional name for the 
processes whereby learners translate nominal stimuli into effective stimuli is 


attention. 


E imi ion of this paper was read at the Annual Meeting of the American 
L A preliminar Aver M oE Angeles, February, 1969. Preparation of this 
paper was supported in part by the Advanced Research Projects Agency through the 


Office of Naval Research under Contract Nonr 3985(08). 
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This paper contains a review of evidence on the role of attentional 
processes in verbal learning and instruction, a description of procedures and 
configurations of material that maintain some control over attention, and a 
discussion of the conditions under which control of attention is important. 
Considered, finally, are student mediating processes which, in addition to 
attention, it may be important to understand and to find ways of 
influencing. 


Attention 


Attention involves at least two processes. The first is orientation of the 
receptors toward the stimulus. This aspect of attention is sometimes observ- 
able. One can, for instance, determine where a student’s eyes are looking. 
A second process in attention is the encoding of the stimulus. The learner 
responds to one or more aspects of the stimulus; this encoding response is 
really the effective stimulus. Examples of encoding responses are subvocal 
articulations of verbal stimuli and images formed with respect to either 
verbal or nonverbal stimuli. The encoding response may involve an elabora- 
oo ona nominal stimulus. For instance, the trigram SIG may be encoded 

nal. 


Although subjects’ reports are often informative, the encoding of a 
stimulus is usually identified by inference rather than by direct observation. 
The logic for making such inferences is illustrated in the research of Under- 
wood, Ham, and Ekstrand (1962) on cue selection in paired associate 
learning. The stimuli were CCC trigrams printed in black on distinctive 
colored backgrounds. After learning the list to criterion, Ss received a 
transfer task which involved the same responses and, as stimuli, either 
the CCCs alone or the colors alone. With the color stimuli, performance was 
at criterion level immediately, whereas the group that received CCCs 
showed no transfer. The presumption is that most subjects encoded color 
instead of one or more of the letters in the trigrams. 


Cue selection, or selective attention, has been demonstrated for other 
types of compound stimuli. Cohen and Musgrave (1964) presented stimuli 
censisting of two trigrams representing high-high, high-low, low-high, or 
low-low combinations of meaningfulness. On a transfer task, more correct 
anticipations occurred to single high-meaningful trigrams than to single 
low-meaningful trigrams. When both training stimuli were of low meaning- 
fulness, there were more correct anticipations to the trigram which had 
occupied the left-hand position in the compound than to the one which had 
occupied the right-hand position. Studies such as this suggest that people 
selectively attend to the strongest, most salient, most meaningful, or most 
discriminable aspect of a compound stimulus. 
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Several studies employing a Russian vocabulary task that simulated a 
self-instructional program demonstrate that procedural variables can influ- 
ence whether people will pay attention to even an easily encodable stimulus 
(Anderson & Faust, 1967; Faust, 1967; Royer, 1969; see also Faust & Ander- 
son, 1967). Each frame in the program consisted of a paragraph of five 
sentences with high frequency English words as subject nouns and Russian 
words as predicate nouns. Immediately below the paragraphs one of the 
sentences was repeated with a blank in place of the Russian word. The 
critical sentence appeared equally often at random in each of the five 
ordinal positions in the paragraphs. None of the filler sentences in any 
frame was the critical sentence in a different frame. In a second otherwise 
identical version of the program the Russian word which was to serve as the 
response was underlined in every frame. Fig. 1 diagrams the minimal 
inspection of a frame necessary to complete the blank correctly. Each 


UNDERLINE A rag is a tryapka. A bridge 
VERSION 
is a mohst. A table is a stohl. A college 
is a vooz. An onion is a look. 


A table is a ———- 


NO UNDERLINE 
VERSION 


A rag is a tryapka. A bridge 


PEERAGE T. 


is a mohst. A table is a stohi. A college 


SEARCH] 


Lis a vooz. An onion is a look. 


A table ls a ——- 


i i frame from a Russian vocabulary program. The lines trace 
ae 1 Toet ‘a the materials required to locate the response term. From 


Anderson and Faust, 1967. i 
frame in the No-Underline version required the student to at least notice 
the cue, since the only way to locate the Russian word that went in the 
blank was to find the English word with which it was paired. The Underline 


version allowed the student to copy Russian words into the blanks without 


looking at the English words. Although both versions of the program led to 
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error-free performance during learning, in every experiment subjects who 
completed the No-Underline version recalled more of the Russian words 
written into the blanks in the program than those who received the Under- 
line version. 


The Underline version of the Russian vocabulary program did not 
prevent students from spontaneously or voluntarily paying attention to 
the English cue words. Among subjects who completed the program most 
quickly and who, therefore, were presumably least likely to notice the 
cues voluntarily, the No-Underline version led to markedly higher learning 
than the Underline version; there was little difference between the two 
versions for subjects who completed the program slowly. 


Faust (1967) experimentally manipulated motivation in order to 
influence voluntary attention to the cues. The No-Underline version was 
much superior to the Underline version when subjects cancelled letters 
for 30 minutes before starting the program. The advantage of the No- 
Underline version was smallest for subjects who were told that performance 
on the program was a measure of learning ability that might relate to 
academic and professional success. In a concurrent experiment, Faust 
instructed some subjects that they would be judged chiefly in terms of 
how quickly they completed the program. Those who received the No- 
Underline version learned many more words under these conditions than 
subjects who got the Underline version, presumably because voluntary 
attention to the cues was reduced. These results indicate that control of 
attention is likely to be most important when students are bored, tired, or 
under pressure to work quickly. 


Subject responses to post-experiment questionnaires were uniformly 
consistent with the notion that more English-Russian associations were 
learned from the No-Underline than the Underline program because the 
former version required the subject to notice the cue as a condition for 
correct responding. Aside from these data, the evidence advanced thus 
far for the hypothesis of attentional control is largely indirect. Royer (1969) 
observed the eye movements of subjects through a peephole in the stimulus 
materials as they completed one or the other version of the Russian vo- 
cabulary program. To reduce voluntary inspection of the English cues, 
subjects worked arithmetic problems for 30 minutes before they were given 
the program. The group that received the No-Underline version learned 
more Russian words and observed the English cue words more frequently 
than the group that received the Underline version. Neither difference was 
statistically significant, but the trends were consistent with the hypothesis. 
The authors believe the experiment failed to show stronger effects because, 
despite the arithmetic problems, there was a high level of voluntary inspec- 
tion of the cues out of deference to the experimenter who, needless to say, 
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watched performance very closely. In all of the other experiments with the 
Russian vocabulary task, subjects were run in groups of five or more, 
Deference, or compliance, probably operates less strongly in a group setting. 


The Attentional Analysis of Instructional Procedures 


The analysis of attention has direct implications for instruction. The 
question to be asked about any instructional procedure is, “Could a person 
giving minimal compliance to the demands of the teacher or teaching agent 
fail to pay attention to critical material?” The argument is that students 
tend to follow a principle of least effort (cf. Underwood, 1963). When it is 
possible to short-circuit the instructional task, students will often fail to 
learn what a lesson is intended to teach them. 

An attentional analysis can shed light on many of the anomalous 
results that have appeared in the literature on programed instruction in 
recent years. Several programed instruction techniques will be treated 
in detail on the following pages. Then the role of attentional factors 
within other instructional methods and educational settings will be briefly 
considered. 


Prompting 


Prompting is the technique of providing hints of various kinds to help 
the student respond correctly (cf. Markle, 1969). Whatever advantages 
prompting may have, an attentional analysis suggests that overprompted 
programs often permit the student to respond correctly on the basis of 
prompts alone without paying attention to the entire cue, even, for example, 
when he is defining a technical term. As a result the cue often fails to 
become a discriminative stimulus for the response and, for example, the 
student is unable to produce a technical term when its definition appears 
on an achievement test. Anderson, Faust, and Roderick (1968) altered the 
first 1,052 frames from the Holland and Skinner (1961) program to include 
additional prompts in most frames. As expected, groups that received the 
unaltered program scored significantly higher on the achievement test than 
did groups that were given the heavily-prompted version. A postexperiment 
questionnaire asked subjects to comment on any short cuts they used to 
complete the program. Twenty-one of the fifty-four subjects who received 
the heavily-prompted program and one who was given the standard version 
reported using prompts to take short cuts. The comments were interesting; 
for instance, one student wrote, “I noticed that the underlined words were 
usually the answers so [I] often copied them before reading. Another 
student commented, “When the first letters of the missing answer were 
given, it was difficult to not answer [before] studying the question ..., as 


the answers were obvious.” 
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Duell (1967) investigated prompting techniques for teaching children 
a sight vocabulary. Two of the procedures she compared are diagramed — 
in Fig. 2. First the children briefly saw a word paired with a picture, 
When the second frame appeared the child was instructed to “point to the 
one you just saw and read it.” Stolurow and Lippert (1964) reported that 
a procedure resembling the one pictured on the left in Fig. 2 was 
successful in teaching mental retardates a sight vocabulary. Ho 
PROMPTED UNPROMPTED 
DISCRIMINATION DISCRIMINATION q 


WORD A PICTURE A 


WORD A PICTURE A 


WORD A PICTURE A WORD A 


WORD B PICTURE B WORD B 


Fig. 2. Sequence of events in two procedures for teaching children a sight vocabulary: 
Based on Duell, 1967. 


Duell reasoned that this procedure does not require attention to the cue 
words; the child can always respond correctly on the basis of the picture 
alone. A child given the procedure shown on the right must pay attention 
to the word cue in order to consistently respond correctly. Normal kinder- 
garteners received sixteen trials on a list of eight words using either 
prompted or unprompted discrimination frames. Children who recei 
the unprompted procedure learned almost four times as many words as 
children given the prompted procedure. Under the assumption that 
children look where they point, Duell divided the children who got 
prompted discrimination frames into two groups on the basis of 
frequency with which they pointed at the word instead of the picture 
during training. Children who most frequently pointed at the word learned 
four times as many words as those who usually pointed at the picture — 
(but still only somewhat more than half as many words as those who 
received the other procedure). 
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In contrast to the experiments with instructional materials, several 
paired associate studies showed that people learn faster under a prompting 
procedure, in which both the stimulus and response term appear together 
before the response is required, than under a “confirmation” or anticipation 
method (Cook & Spitzer, 1960; Sidowski, Kopstein, & Shillestad, 1961; 
Levine, 1965). However, Cunningham and Anderson (1968) noted that 
the prompting method entailed more time than the confirmation method 
in which the stimulus and response term were available for rehear- 
sal or mental elaboration. They found no difference in between 
the prompting method previously used and a confirmation procedure 
modified to equate effective study time. 


Active Responding 


It has been an article of faith among programers that students learn 
more when they are required to make frequent, overt responses. Therefore, 


students who were required to make overt responses learned no more than 
students who were asked to “think” the answers that went into the blanks 
or to read programs with the blanks filled in (Alter & Silverman, 1962: 
Crist, 1966; Della Piana, 1962; Hartman, Morrison, & Carlson, 1963; 
Lambert, Miller, & Wiley, 1962; Stolurow & Walker, 1962; Tobias & Weiner, 
1963). On the basis of an attentional analysis, one could predict that when 
a student can respond correctly to the frames in a program without paying 
attention to critical material, attention to this material is minimized and 
learning suffers. However, the requirement to make overt responses might 
be facilitative provided that correct responding was contingent upon atten- 
tion to all of the critical material. 

Kemp and Holland (1966) developed a measure of the extent to which 
responses are contingent upon ma 


of correct responses. é 3 aom ise $ 
blackout ratio. Obviously, a student cou 
much of the material in these frames. Kemp and Holland si rer 
blackout ratios for the programs used in 12 experiments that EE y 
compared overt responding with covert responding and ok e lowest 
four blackout ratios, ranging from AA to whe Be tiy wu kant Be 
i igni dvantage for overt respo - 
Wimmer ie ek ranging from 31% to 75%, were 


viously. The remaining ei 7 
; employed in studies that showed no 
obtained for programs that were employ ding or reading, 


difference between overt responding and covert respon e 
The research with the blackout technique indicates that cue-contingent, 
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22. A B--- of Exchange (Draft) is convenient for 
the payment of debts. 


23. The seller of merchandise by sending a Bill of 
E: drawn on the buyer and attaching the 
shipping documents to a bank for collection 
can be assured that the merchandise will not 
be delivered to the buyer until the buyer pays 
for it. 


Fig. 3. Two frames from a monetary program. Adapted from Holland and Kemp, 1965. 
overt responses facilitate learning. It may well be true, however, that the 
requirement to make overt responses which are not cue-contingent disrupts 
normally adaptive reading habits. 


Immediate Feedback 


To the behaviorist, the most shocking result of programed instruction 
research is the finding that programs teach as much or more when 
immediate “reinforcement” is omitted (Krumboltz & Weisman, 1962; Hough 
& Revsin, 1963; Rosenstock, Moore, & Smith, 1965; Sullivan, Baker, & 
Schutz, 1967; O'Day, Kulhavy, Anderson, & Malznski, in press). Lublin 
(1965) found that a group that completed the first 1,052 frames of the 
Holland and Skinner (1961) program without feedback actually did sig- 
nificantly better on the criterion test than the group did that saw the correct 
answer after every frame. These findings are very difficult to understand 
if one assumes that seeing the correct answer is reinforcing. But even if it 
is not reinforcing, displaying the correct answer can obviously furnish 
corrective feedback when the student makes a mistake. Therefore, no matter 
what the theory is, it would seem that providing immediate knowledge of 
results should facilitate learning. 


A plausible interpretation of the failure to find facilitation is that 
students pay less attention when immediate feedback is provided. Indeed, 
this has been the hypothesis of several of the previous investigators. Some- 
times there may be a gross short-circuiting of attention when the correct 
answer is readily available: the student may copy it into the blank without 
reading the material in the frame. Several of the feedback studies, including 
Lublin’s, were completed with the frames printed in consecutive order down 
sheets of paper. The correct answers appeared immediately below the 
frames. This arrangement makes accidental exposure of the answer difficult 
to avoid, and of course makes deliberate cheating easy. 
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Anderson, Kulhavy, and Andre (1970) completed two experiments 
using a program on the diagnosis of myocardial infraction from ECG 
tracings. The program was presented on PLATO, a computer-based teaching 
machine,system which made it possible to insure that students responded 
before fey saw the correct answer. In each study, the group that always 
received feedback did significantly better on the criterion test than the 
group that never received feedback. Especially interesting was the perform- 
ance of a group that had the correct answer continuously in view in the 
lower right hand corner of every frame. The directions for this group 
stated three times that the student should respond before he looked at 
the correct answer. Nonetheless, this group scored significantly below the 
no feedback group on the criterion test. Eleven of the twenty-two subjects 
in the group that had the correct answers continuously in view volunteered 
after the experiment that they could not avoid looking at the answers 
before they responded. 


Anderson (1969) evaluated the effectiveness of a program on popula- 
tion genetics in fifteen high school biology classrooms. To reduce accidental 
exposure of the feedback, each frame occupied a separate page; the feedback 
for that frame appeared at the top of the next page. For the same reason, 
the program was mimeographed on blue paper through which it was im- 
possible to read the feedback (see Brown, 1966; Anderson, Faust, & 
Roderick, 1968). When they completed the program, students were asked 
how frequently they had turned the page and copied the correct answer 
when they found a question difficult. Despite the fact that the directions 
to the program exhorted students to “write an answer to each question and 
fill in each blank before you look ahead to the correct answer, better than 
40% said they sometimes copied answers to difficult questions and 20% 
said they usually did. Fig. 4 plots achievement as a function of the fre- 
quency with which students reported peeking at the right answer. 


Instructional sequence. It is possible to give an attentional interpreta- 
tion to the studies showing that some programs teach as well when the 
frames are presented in a scrambled order as they do when the frames are 
presented in the “careful, logical sequence” intended by the programer 
(Roe, Case, & Roe, 1962; Levin & Baker, 1963; Payne, Krathwohl, & 
Gordon, 1967). The adjacent frames in many programs involve very 
similar responses to very similar stimuli. It could well be that a student 
pays less attention to consecutive encounters with essentially the same 
material than he does when he encounters the material again after an 
interval filled with different material. Spaced encounters are more likely 
when the frames in a redundant program are presented in a scrambled order 


(see Rothkopf & Coke, 1966; Greeno, 1964). 
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PERCENT CORRECT ON THE ACHIEVEMENT TEST 


30 


ES USUALLY SOMETIMES HARDLY NEVER 


EVER 
FREQUENCY OF COPYING 


Fig. 4. Achievement as a function of the reported frequency of copying correct responses 
to difficult frames. From Anderson, 1969, 


Questioning During Reading 


Attention is probably an important factor in learning from written 
materials. Hoffman (1946) completed a suggestive study in which college 
undergraduates read a history textbook for four consecutive hours, Eye 
movements were monitored by recording changes in electrical potential 
through electrodes attached near the reader’s eyes. The “electroculograms” 
showed that the»pattern of eye movements associated with reading steadily 
deteriorated over the four-hour period. For instance, half-hour by half-hour 
there was a regular increase in the frequency of eye blinks and the 
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variability of eye movements, and a regular decrease in the number of lines 
read per minute. Hoffman's graphs resemble extinction curves. 


In a related study, Carmichael and Dearborn (1947) monitored the 
cye movements of high school and college students as they read for 
continuous six hour periods. Unlike Hoffman, they gave a quiz covering 
the preceding material after every 25 pages. In their concluding remarks, 
Carmichael and Dearborn (p. 370, 1947) stated: 


“||. by the introduction of tests . . . and by certain other 
devices calculated to render the performance more constant, the 
motivation and general attitude of the subject were controlled in 
a way that they had not been in the preliminary t 
[the Hoffman experiment]. The result of this apparen small 
change in experimental conditions was to demonstrate the virtual 
absence of work decrement or fatigue on the part of all sub- 
jects . . . , so far as the objective records of the eyes behavior are 


concerned.” 


The Carmichael and Dearborn study foreshadowed the recent work of 
Rothkopf and others. Rothkopf (1966) divided # 5,200 word chapter 
from Rachael Carson’s book The Sea Around Us (1961) into seven ne 
cach three multilithed pages in length. Before or after reading cad 
section of the chapter, groups of paid adult volunteers answered 
questions based on that section. Upon completing the entire chapter 
subjects were given a criterion test that consisted of the questions re 
while reading and additional questions that were not practiced. E 
practiced questions, subjects who answered these same eta is 
the passage did about 40% better than the reading-only y ov. 
pte questions after, but not pr sree i a sect p 
also had a significant effect on nonp 
attentional hypothesis, asking questions before a person Szered ype A| 
should cause him to sana for the answers to these quest 
not be expected to facilitate per 
tained in pre achievement test. An outside control group drilled on the 
practiced questions did not improve it 
therefore, the facilitation associated w 
could not be attributed to transfer of 
increased attention. ae ee rhe 

isbi at if the inc 

ee Behien A ET from inserting,questions within 


a lesson is caused by getting st 
the questions on a certain kin 
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this kind of content. For each three page section of The Sea Around Us 
some groups of high school students were asked two questions requiring 
either a measured quantity (a distance, a date) or a proper name as an 
answer. Other groups were asked only questions that could be answered 
with a common English word or a technical term (Bathyscaphe, photo- 
tropic). Once again, asking questions after the student had read the 
relevant section improved posttest performance on nonpracticed items, 
whereas asking questions before the section upon which they were based 
did not. Presenting a restricted category of questions within the passage 
facilitated performance on nonpracticed achievement test items within that 
category. As predicted, this effect was somewhat greater for achievement 
test questions based on the second half of the passage than on the first 
half, presumably because time was required for the questions within the 
passage to shape the reader’s attention. 


Subsequent research (Frase, 1967, 1968a, 1968b; Bruning, 1968) 
confirmed that inserting questions in reading material, either before or 
after the section of the passage on which they are based, sharply improves 
performance relative to a no-question group when the same questions are 
repeated again on the criterion test. These studies also confirm that per- 
formance on nonpracticed criterion test questions improves only when 
questions are inserted in reading material after the relevant passages. 


Classroom Instruction 


Attention is probably an important determinant of the effectiveness 
of classroom instruction. Evidence for this proposition can be found in 
a study by Morsh (1956) of the classroom behavior of the students and 
instructors in 120 sections of an aircraft mechanics course. The dependent 
variable was residual student achievement gain, defined as the difference 
between actual performance on the achievement posttest minus the posttest 
score predicted from pretest score. Ten types of student behavior were 
observed using a systematic observation scheme. The frequency of five 
types of student behavior correlated significantly with achievement gain: 
looks around (—.21); ignores instructor (—.27); slumps (—.31); yawns or 
stretches (—.41); and sleeps or dozes (—.28). Obviously, each of these 
is indicative of inattention. 


Based on the research with written instruction, intermittent questioning 
is a likely candidate to improve attention to lectures and other classroom 
activities. Van Wagenen and Travers (1963) and Travers, et al. (1964) 
taught German vocabulary to face-to-face groups of eight fourth, fifth, or 
sixth graders for three consecutive days. The teacher held up large cards 
upon which German words and two English alternatives appeared. For each 
card, one child was called on to guess the correct English associate. Only 
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four of the children in each group participated in actively responding. The 
other four were observers. On the posttest, the participants performed better 
than the observers on the nonpracticed items as well as the ones they had 
actively practiced. The difference between the participants and observers 
on both practiced and nonpracticed items grew larger from the first to the 
third day. 


Other classroom research has failed to show effects attributable to 
increased attention when questions are inserted within a lesson. Michael 
and Maccoby (1961) showed 49 classes of high school juniors and seniors 
a 14-minute film on civilian defense against atomic bombing. Some classes 
only saw the film. For other classes, the film was stopped three times. 
During the breaks the teacher read questions covering some of the points 
presented in the preceding section of the film. On posttest questions which 
had been asked and answered during the film, classes that had received the 
questions did substantially better than classes that only viewed the film. 
However, there was no difference between the groups on additional posttest 
questions that had not been asked during the film. 


The Michael and Maccoby experiment was completed in 1950 and 
1951 shortly after the Soviet Union exploded its first atomic device, so it 
could be argued there was no facilitation of nonpracticed items attributable 
to attention because the film was intrinsically interesting and the students 
were already paying close attention. Based on this reasoning, Maccoby, 
Michael, and Levine (1961) presented what they presumed to be a dull 
film that traced the history of world map concepts from ancient times to the 
Air Age to a thousand air force trainees. Half of the airmen engaged in 
active review sessions, two following brief sections of the film and one at 
the end. In addition to the usual large effect on repeated questions, the 
group that participated in active review showed a slight ee 
advantage over the film-only group on new questions that had not been 
asked during the active review sessions. k - 

With imagination and persistence, it should be possible to modi 
the techniques eats by Toca and the developers of Sere 
materials so as to better maintain and regulate attention. Consider 
classroom discussion techniques for illustration. An hypothesis that follows 
from an analysis of attention is that a class of students will benefit most 
from recitation when the teacher asks a question, pauses, and then calls on 
a student. Whether or not he is actually called on, this arrangement should 
maintain the student’s attention to questions and make it probable that he 
will formulate answers since he may be required to answer overtly. When 
the teacher calls on a student before he asks a question, students not 
designated are relieved of the responsibility to answer in public. Therefore, 
they may not pay attention to the question nor formulate answers. The 
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Morsh (1956) study previously reviewed contains evidence in support of 
hypothesis. Only one of the twenty-five kinds of teacher behavior counted— 
“Ask question, Designate student” (to be distinguished from “Designate 
student, Ask question” )—was significantly correlated (.22) with student 
achievement gains. “te 
> 

Cue Encoding and Associative Linkage : 

Attention, in the narrow sense of noticing a stimulus, is a necessary 

but not a sufficient condition for learning from verbal materials. I speculate 
that there are at least three mediating processes required to give rise to 
verbal associative learning. The first is noticing the stimulus. The second is 
encoding the stimulus in a “meaningful” manner. The third is conceiving 
linkages between the aspects of the stimulus, including, especially, the 
aspects which will later serve as the cue and the response. Observation 
of the stimulus has already been discussed at length. 


Consider meaningful encoding of the stimulus. Probably the first 
thing a person does with a word he notices is to say it to himself. This can 
be called auditory encoding. Next, the person may give a semantic encoding 
to the word. It is impossible at our present stage of knowledge to be precise 
about what semantic encoding involves, but it undoubtedly includes an 
internal sensory representation, that is an “image,” of the thing or event 
named by the word (cf. Bower, in press). Paivio (1969) reviewed a series 
of experiments which indicate that the image-evoking value of words is 
the most important determiner of the learnability of these words, more 
important than meaningfulness, Thorndike-Lorge frequency, or semantic 
differential ratings, for instance. 


The chief point for the present argument is that auditory encoding 
precedes semantic encoding and that the two processes are distinguishable. 
Evidence for two encoding stages comes from research on memory. Most 
errors in short-term memory arise from confusions between sounds even — 
when materials are presented visually (Conrad, 1962; Conrad & Hull, 1964; 
Wickelgren, 1965, 1966) whereas errors in short-term memory due to 
confusions in meaning are relatively rare (Baddeley, 1964). In long-term 
memory semantic confusion is a more important source of interference 
than acoustical confusion (Baddeley, 1966; Baddeley & Dale, 1966). These 
data indicate that the first response to a verbal stimulus is a subv 
articulation. However, if the stimulus is to be remembered for more than a 
few seconds, the articulatory response is followed by some form of meaning- 
ful representation, j 


s A final mediating process known to be important in associate learning 4 
is the development of a linkage between the cue and the response. 
Montague and his coworkers (Kiess & Montague, 1965; Montague, Adams 
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& Kiess, 1966; Montague & Wearing, 1967) showed that the learning of 
a pair of verbal units is almost always accompanied by the development 
of a meaningful linkage which they call a “natural language mediator.” 
Rohwer (1966) and others (cf. Paivio, 1969) found that children learn to 


imagery” group recalled 71%, whereas the “separated imagery” group 
recalled only 46%, a highly significant difference. Taken together, these 
studies demonstrate that conceiving a linkage between the cue and response 
is an important process in associative learning. 


Implications for Instruction of the Research 
on Mediating Process 


as necessary if an instructional communication is to give rise to learning. 
These processes are noticing the stimulus, translating it into internal speech, 
evoking images for the things and events named by the words, and con- 
ceiving relationships among the imagined things or events. This is 
admittedly a speculative account, but whatever its shortcomings it has eg 
virtue of giving a fresh perspective on the task of the instructional = 
nologist. In this view, the chief problem for educational engineering is to 
discover how to alter the characteristics of instructional tasks so as to force 
students to do all of the processing required for learning. i os ath 
Once again, the underlying assumption is that people tend to follow 
a law of tes effort. That Hd it is assumed that people can be counted on 
to engage in only the processing demanded by the task. To illustrate, in one 
of the Russian vocabulary studies reviewed earlier (Faust & Anderson, 
1967) Russian words were used which were either easy to pronounce or 


hard to pronounce. The procedure which forced attention to the cues was 


facilitati -to-pronounce words. Further analysis suggested 
nan aae oa y t to produce learning only when a 


that just noticing a pair was sufficien ; 
rods range to find. One pair which the attention-forcing 
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procedure facilitated strongly was as follows: “A beer is a peevoh.” Anyone 
who upon occasion drinks a lot of beer will have no trouble linking this 
pair. For a pair which was hard to associate, merely noticing the cue would 
not evoke a mediator. This would require meaningful processing which, it 
is argued, was not forced by the procedure. 


Under some conditions procedures which force attention may actually 
decrease the likelihood that learners will engage in deeper processing. 
When people must deal with a large quantity of verbal material, process- 
ing deteriorates. Attention can be improved by asking frequent questions. 
However, frequent questioning will not inevitably improve learning because 
now only short-term memory is required. Depending upon the questions, a 
person may be able to answer them by simply rehearsing the words (saying 
them to himself) in the preceding passage. 


It is a common observation that a person can read aloud or recite 
the Pledge of Allegiance or the Boy Scout Law without bringing to mind 
the meaning of the words he is speaking. There is rather dramatic evidence 
that the mere pronunciation of words is ineffective in producing learning. 
Bower (in press) reported several experiments in which subjects instructed 
to image a relationship between the objects named in pairs of words recalled 


an average of over twice as many words as subjects who said each pair 
aloud several times. 


The trick is to arrange a task that requires full rocessing from the 
learner. It has been shown several times oe adult abet ves pairs of 
words more quickly if they are required to generate a sentence linking the 
a rather than to read a sentence which the experimenter prepared (Jensen 
aa 1069; Bower, in press). One interpretation of this finding is that 
i nate sentences are better because each subject selects mediators 
y ich have the highest associate strength for him. Anderson, Crooks, Kul- 

avy, and Lieberman (1969) reasoned that experimenter-provided medi- 
aini mishi prove superior to subject-generated ones if the (1) pairs were 
ulicult to associate and (2) the cue and mediator strongly evoked the 
ee = response terms were English words and, to make the pairs 
haga e cues were CVC trigrams. Difficulty was further differentiated 
y Having pre-experimental subjects rate the associability of the pairs. The 


CVC-word pairs were embedded in “th t a 
example of which follows: in thematically prompted” sentences, an 


Before turning red, traffic SIG are yellow. 


To estimate the extent to which the 


rompt i the 
E prompt and cue determined 


in pl Pre spopnto] subjects were given each sentence with a blank 
n piace of the response term, to be filled with a word which followed 


364 


Deha 


ANDERSON CONTRO. OF MEDIATING PROCESSES DURING VERBAL LEARNING & INSTRUCTION 


sensibly from the rest of the sentence. An experiment was conducted in 
which subjects completed ten study-test trials with prompted sentences, 
unprompted sentences (e.g, SIG are yellow), or the CVC-word pairs. 
The CVC’s alone were presented on test trials. The results were negative. 
The prompted condition was no better than the other two conditions even 
on the nine item sublist composed of hard-to-associate pairs in which the 
response was. strongly determined. Post-experiment interviews suggested 
that subjects were reading the sentences in a cursory fashion. James Royer, 
Raymond Kulhavy and the author modified the prompting procedure in 
the attempt to induce deeper processing. The prompted sentences were 
presented for four seconds with a blank in place of the response term and 
then for two more seconds in which the response term was available. In 
two unpublished experiments, subjects given the modified prompting pro- 
cedure learned the pairs more rapidly than control subjects. The explanation 
is that under the modified prompting procedures, subjects had to call up a 
meaningful representation of the rest of the sentence in order to generate 
the response term. Obviously, the mere sound of the words could not evoke 
the response. 


Bobrow and Bower (1969) completed several experiments which 
demonstrate that procedures which cause subjects to comprehend the 
meaning of sentences strongly facilitate learning. In one experiment, 
subjects were instructed to compose â sentence which was a sensible 
continuation of each sentence he saw. For example, if he saw the sentence 
“The farmer discovered a diamond,” a sensible continuing sentence might 
be “He sold it to a jeweler and used the money to buy a tractor. Despite 
the fact that they were vulnerable to interference from the sentences they 
had created themselves, the subjects in the continuation condition recalled 
twice as many predicate nouns, given the subject nouns as retrieval cues, 
than control subjects required to read each sentence aloud three times. 


Frase (1969) completed some fascinating research which shows 
that the nature of the processing that a person must give to a task 
influences what he will learn. The materials consisted of 30 sentences, 


cach of which described one of the three attributes of ten imaginary planets. 


The subjects were given the description of the three attributes of one of the 


l ch through the materials and find the name of 
aa kly a paii For half of the subjects the 


this target planet as quic de 
sentences were organized by planet; that is, the sentences describing the 


attributes of each planet appeared together. The sentences hag grouped 
by attribute for the remainder of the subjects. When a subject ad oa 
pleted his search for the name of the target planet, surprise recall an 

recognition tests were given. As Frase had predicted, attribute organization 
resulted in superior learning of planet names whereas organization in terms 
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of the planets led to greater learning of attributes. The explanation was 
that when sentences were organized by attribute the subject had to hold 
the name of each candidate planet in memory while checking to see if it 
had the characteristics of the target planet. Organization by planet, it was 
argued, disposed the subject to process the sets of attributes; when a match 
was found, the subject could simply read off the planet’s name. 


In summary, several lines of evidence indicate that learning is 
facilitated when the task requires meaningful processing. While few of the 
specific techniques for promoting deep processing that have been investigat- 
ed so far are likely to have direct pedagogic application, the general impli- 
cation for education is clear, almost tautological: students will learn most 
when they are required to understand. 
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If there are some subjects on which the results obtained have 
finally received the unanimous assent of all who have attended 
to the proof, and others which . . . have never succeeded in estab- 
lishing any considerable body of truths, so as to be beyond denial 
or doubt; it is by generalizing the methods successfully followed in 
the former enquiries, and adapting them to the latter, that we may 
hope to remove this blot on the face of science. 
John Stuart Mill 


Some of the major disasters of mankind have been produced by the 
narrowness of men with a good methodology. . . . To set limits to 
speculation is treason to the future. 


Alfred North Whitehead 


Educational research has been tried and apparently it has failed. Or 
has it? 


If the object of such research is the development of coherent and 
workable theories, researchers are nearly as far from that goal today as 
they are from controlling the weather. If the goal of educational research 


I know of little evidence that researchers have made discernible strides in 
that direction. Which way then do they turn? To more of the same? Or to 
a pragmatic attack on highly specific educational problems, eschewing 
theory development as a goal? Or do they reexamine the basic paradigms 
and parameters of both education and research in order to seek new 
directions? In this paper, I argue that neither a slavish continuation of 
current practices nor a monolithic rejection of them is likely to solve the 
problems of educational research. Researchers must step back, regain 
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perspective and then identify clearly the most fruitful routes toward de- 
velopment of an empirically based discipline of education. 

I begin this paper with a brief examination of the nature of education 
and the history of its relations with the behavorial sciences, especially 
psychology. Then I discuss certain characteristics of the instructional 
process and problems involved in its study. I conclude this paper with a 
review of potentially useful strategies of educational research and the 
institutional prerequisites for their successful implementation. 


Retrospect: Education and Psychology 


Moving up the phylogenetic scale, one finds that the period of child- 
hood characteristic of the different species increases in length. Moving from 
the less complex to the more complex organisms, the relative proportions 
of instinct to learning as forces influencing behavior rapidly change, until, 
with man, the role of learning is so central that the concept of instinct 
becomes nearly irrelevant. Even in the highly nativistic theory of language 
acquisition proposed by Chomsky (1965) and his followers (see Lenneberg, 
1967), learning plays a contral role in the development of specific linguistic 
performance. Thus, the major discontinuity between the human species 
and all others lies in what McNeil (1963) has called “systematic develop- 
mental retardation.” 


Indeed, the helplessness of human young must at first have been 
an extraordinary hazard to survival. But this handicap had 
compensations, which in the long run, redounded in truly extra- 
ordinary fashion to the advantage of mankind. For it opened wide 
the gates to the possibility of cultural as against merely biological 
evolution. . . . Biologically considered, the interesting mark of 
humanity was systematic developmental retardation, making the 
human child infantile in comparison to the normal protohuman. 
But developmental retardation, of course, meant prolonged 
plasticity, so that learning could be lengthened. Thereby, the 
range of cultural as against mere biological evolution widened 
enormously; and humanity launched itself upon a biologically as 
well as historically extraordinary career. .. . By permitting, indeed 
compelling, men to instruct their children in the arts of life, the 
proleneed period of infancy and childhood made it possible for 

uman communities to eventually raise themselves above the 
animal level from which they began. 


(McNeil, 1963; pp. 20, 21) 


f kei- absence of instinct, the absence of prefabricated behavior 
patterns which program the organism at a very early stage in his develop- 


372 


SHULMAN RECONSTRUCTION OF EDUCATIONAL RESEARGI! 


ment, provides man with his most human characteristic—his educability. 
This malleability acts as a two-edged sword, for with plasticity comes not 
only the potential for limitless growth, but also the danger of inestimable 
damage. 


disciplines which purport to study man and his nature should concentrate 
rather heavily upon studies of his schooling. American psychology at the 
beginning of this century did precisely that. The great men of that period 
of American psychology—William James, E. L. Thorndike, G. Stanley Hall, 
Robert Woodworth, John Dewey and others—were vitally interested in 
studies of the educational process. Such investigations lay at the heart of 
American psychological thinking during that time. However, influenced 
by Lloyd Morgan’s (1894) Canon which effectively placed unobservable 
mental processes out of bounds for psychological study, the discipline of 
scientific psychology in America was slowly transformed from James's 
(1890) “Science of Mental Life” to Watson's (1913) “Science of Behavior. 

In that generally fruitful-antimentalistic revolution was discarded not only 


baby of experimental educational research. Despite Thorndike’s continuing 
admonitions that the proper laboratory for the psychologist was the class- 
room and its proper subject was the pupil, the study of infrahuman species 
and their behavior came to dominate psychology. In many. ways À 

efforts of the rat or primate-oriented psychologists contributed significantly 
to man’s understanding of the natural world. It would be an overstatement 
to assert that the study of school learning disappeared completely from 
psychology. It would be even more misiea® 
were now peripheral to the developing tradition of Ame’ 


psychology. EER i 
Ironically, Soviet psychology, though greatly influen y the Pav- 
lovian uate never abandoned studies of school learning as âà — 
component of psychological research (Menchinskaya, 1969). Because of the 
rtance of education in the Marxist-Leninist tra- 
iti i i i ly on instruction as a 
dition, Soviet psychological studies focused frequently u 
variable of interest. Kilpatrick and Wirszup (1969) report that in a recent 


year 37.5% of all materials published in Soviet psychology was devoted 
to cducational and child psychology. 
The only area of educational psychological research in America which 
remained unscathed by this post-Watsonian revolution was that of the 
then still infant investigations of mental measurement. This tradition, 
growing out of the work of Galton, Spearman, Binet and Simon, J. M. 
Cattell, Terman and others, continued to flourish and received its greatest 
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impetus from the success of mental testing during World War I. The 
emphasis of this movement was quantitative and descriptive. The objectives 
were the careful measurement and prediction of individual differences in 
human abilities. The schism between the respective Weltanschauungen of 
experimental psychology and mental measurement grew progressively wider, 
and it was not totally inappropriate that for many years educational 
psychology was identified with educational measurement.’ It is only in 
the most recent period that these trends began to reverse. The two traditions 
with their respective and almost nonoverlapping methodologies are begin- 
ning to coalesce. 

A good sign that a new field is becoming popular is the creation 
and proliferation of shorthand ways of exchanging fundamental concepts. 
ATI is already recognized by many educational researchers as the acronym 
for “aptitude-treatment interaction” (Cronbach and Snow, 1969), a re- 
search strategy characterized by the marriage of experimental and 
differential approaches. One of the major problems attendant on this 
marriage is the prosperity gap between the two principals. Whereas 
differential psychologists possess a wealth of methods for characterizing 
individual differences, the classification of environments, settings or treat- 
ments remains relatively primitive (Mitchell, 1969). “Aptitude-treatment 
interaction” will likely remain an empty phrase as long as aptitudes are 
measured by micrometer and environments are measured by divining rod. 

In the next section I examine several aspects of environment as a 
factor in educational research: the problems of investigating the effects of 
environments, the kinds of variables that can be used to characterize 
environments and the relevance of environmental analysis to the method- 
ology of educational experimentation. 


The Study of Environments 


Social scientists are dramatically impotent in their ability to character- 
ize environments. Generally, they do not even try. It should by now be a 
truism to point out that neither individuals nor groups can be adequately 
desvribed without reference to some setting. Thus, for Dewey, the starting 
point of his discussions was always “some organism in some environment” 
(Dewey, 1938). Murray (1938) posited two equally important categories 
for his studies of personality: needs and press, i.e.. person variables and 
environment variables. The language of education and the behavioral 
sciences is in great need of a set of terms for describing environments that 
is as articulated, specific and functional as those already possessed for 
characterizing individuals. 


*For fuller discussion of the meaning and implication of this distinction, see Cronbach 
(1957, 1967), see also Cattell (1966b). 
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An example that is familiar to all educators is the continued use ol 
such gross terms as “deprived” or “disadvantaged” to characterize the 
environments of many minority-group children. Labeling the setting as 
“disadvantaged,” of course, communicates little that is meaningful about 
the characteristics of that environment. Educators seem unable to progress 
beyond such a simple dichotomy as “advantaged-disadvantaged.” Reviewers 
and critics of research have long realized that even those few categories 
which attempt to describe environments, such as social class, have been 
remarkably ineffectual in pinpointing the educationally relevant differences 
in the backgrounds of individuals (Bloom, 1964; Karp and Sigel, 1965). 

Imagine if the nutritionist, in his attempts to characterize the nutrition- 
al status of the diets of individuals, were to be limited to a distinction 
between “well-nourished” and “malnourished” individuals. One would be 
quite skeptical of the value of generalizations such as “malnourished indi- 
viduals have a higher incidence of respiratory ailments than well-nour- 
ished,” or “‘well-nourished subjects were observed to run significantly faster 
than malnourished subjects.” Are educators pronouncements about all the 
differences between culturally-disadvantaged and culturally-advantaged 
children any more fruitful, And are the myriadic studies contrasting lower- 
class and middle-class youngsters of any greater value? Such descriptive 
studies do not begin to suggest the necessary ingredients of experimental 
programs to change the conditions. Should one simply elevate all lower-class 
people to the middle-class? What could that possibly mean? 

The nutritionist can describe the nutritional environment of individuals 
in terms of caloric content, relative proportions of carbohydrates, fats and 
protein, the presence or absence of quantities of vitamins and minerals, etc. 
(Eichenwald and Fry, 1969). Possessing such precise terms allows him to 
plan systematic tactics of modifying the nutritional status of individuals in 
terms of highly complex, yet manageable patterns. Atttaining such a level 
of facility in characterizing the educationally-relevant facets of environments 
should be one of the major goals of educational research. Without such an 
understanding, researchers are clearly handicapped in any attempt to make 
intelligent comparisons among proposed educational programs, (such as 
Headstart models) for these programs are themselves planned environments. 

A number of behavioral scientists have begun to study the character- 
istics of environments in a systematic fashion. Bloom’s work (1964) is of 
special interest. Bloom reported many instances of great improvement in 
the effectiveness of academic prediction when measures of the intervening 
environments were taken into account in the prediction equations. He 
emphasized that researchers must replace the older, static terms for describ- 
ing environments (¢.-, social class) with dynamic, process variables (¢.g., 
achievement press). As evidence for this assertion, he cited the research of 
Dave (1963) and Wolf (1964). Such process variables could well prove to 
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have causal influences on the characteristics of interest. This would have 
to be demonstrated through techniques of cross-validation. The goal of all 
such predictive research should be, Bloom maintained, not the inexorable 
stamping of fates on helpless children, but the identification of the critical 
processes contributing to those fates. An understanding of the process 
variables most responsible for the ultimate status of individuals in some 
growth area can provide valuable guidance as researchers attempt to 
develop effective methods for modifying those processes and, hence, for 
destroying the accuracy of their predictions. 


The work of Barker and his colleagues (1968) reflects a totally 
different set of strategies—those of ecological psychology—for studying the 
environment. Pace and Stern (1958) applied the tools of psychological 
measurement to the task of characterizing the essential differences among 
college environments, Henry (1963) used the methods of anthropological 
investigations to study the home and school as elements of culture. From 
his work came the compelling concept of the “hidden curriculum” in the 
middle-class home. Workers in the field of sociology have long been 
involved in studies of the environment. In a recent review, Cartwright 
(1968) examined the sociologists’ approaches to the problems of “ecological 
variables.” Mitchell (1969) discussed the characterization of environments 
for studies of peron-environment interactions in educational research. It 
is only through such environment-centered research that behavioral 
scientists can develop adequate terms to describe the educationally relevant 
attributes of the settings within which human learning occurs. 


The Experimental Setting: Environment for Research 


In addition to the need for increased activity in the characterization 
and measurement of general environments, educational researchers must 
devote attention to one particular kind of environment with which they 
work most frequently—the experimental setting, with special reference 
to the tasks they create for the study of human behavior. 

Most formal education currently involves groups of children studying 
standard school subjects. It seems clear that the classical psychological 
theories of learning and motivation are not capable of explaining or guiding 
those school activities. The reason for this incapacity can be expressed in 
terms analogous to those used to explain the limitations of transfer-of- 
training. To the extent that research is conducted in a setting similar in 
its characteristics to the school situation, to that extent one will get reason- 
able extrapolations from it to the classroom milieu. 


i It should be no surprise that the history of behavioral science research 
in education is not particularly glorious. The differences between the human 
learning laboratory and the typical classroom are numerous, The differences 
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between the animal learning laboratory and the classroom are far greater. 
Researchers have been all too quick to generalize even from the latter setting 
to classroom behavior. In discussing the inadequacy of psychoanalysis as a 
general personality theory, Bettelheim (1960) cited quite parallel con- 
ditions. He pointed out that psychoanalysis was doomed to failure as a 
general personality theory because all of its generalizations were extrapola- 
tions from that most restricted of experimental settings, the psychoanalytic 
couch. In the same manner, does it not seem presumptuous to expect that 
a learning theory based upon evidence from the T-maze, the pigeon’s 
press-bar or the memory drum can effectively be used to guide the planning 
of that most complex of human endeavors, the typical classroom? This is not 
to deny the future relevance of “conclusion-oriented” inquiry (Cronbach 
and Suppes, 1969) to the conduct of schooling. The present gap between 
such studies and needed educational applications is simply too great. An 
intermediate level of investigation is needed to bridge that gap and create 
the basis for educational theory. 


There is a danger in the overextension of this principle of judging the 
relevance of research settings by their congruence with actual school settings. 
One must not be trapped into always viewing the contemporary configur- 
ation in which pupils, teachers, and schools are found as the necessary 
setting to which experimental results must always transfer. There is no 
ipso facto reason to judge that the current status of schooling is the only 
possible or necessary way to organize education. The educational researcher 
must be prepared to introduce change, not only into the experimental treat- 
ment, but into his conception of the accepted forms of instruction as well. 


Researchers are caught in a bind. To maximize the internal validity of 
experiments, they develop carefully monitored settings within which they 
can govern their research. This has long been recognized as a necessity, 
but it is likely that the experimental tradition in America over-emphasized 
the importance of reliability and precision at the expense of the character- 
istics affecting that other factor of equal importance in the development of 
experimental settings, external validity (Campbell and Stanley, 1963; Wig- 
gins, 1968). It is not sufficient that the individuals studied as a sample are 
truly representative of that human population to which the results of a 
particular experiment will be inferred. Researchers must also ascertain that 
the experimental conditions can serve as a sample from which to make in- 
ferences to a population of external conditions of interest. That is, research- 
ers must also attempt to maximize the similarity between the conditions in 
which they study behavior and those other conditions, whatever they may 
be, to which researchers may ultimately wish to make inferences. The 
similarity should hold between psychologically meaningful features of 
the settings, not merely between the manifest aspects of the two situations. 
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Brunswick (1956, p. 39) wrote that “proper sampling of situations 
and problems may in the end be more important than proper sampling of 
subjects, considering the fact that individuals are probably on the whole 
much more alike than are situations among one another.” Bracht and Glass 
(1968) analyzed in detail some of the problems of external validity in 
educational research, distinguishing population validity from ecological 
validity. Their emphasis in the latter category, however, is typically on 
the features associated with the research context, rather than on the specific 
features of the experimental tasks themselves. There is no doubt that such 
factors as Hawthorne Effect, Experimenter Effect, Pretest or Posttest Sensiti- 
zation are important sources of ecological invalidity. I would focus, however, 
on the problem of task validity. Are the actual mental operations or be- 
haviors the subject is called upon to perform in the course of the experi- 
ment reasonably congruent with what takes place in the external domain of 
interest? 


Psychological investigation of verbal learning, concept learning, and 
problem solving—which ought to be most useful to educators—is most 
culpable on these grounds. Within these three areas lie most of the 
objectives of formal education. An excellent example of lack of external 
validity comes from verbal learning research which is shackled to the 
ubiquitous memory drum or to its apparent heir, the tachistoscope-linked 
Carousel slide projector. If by external validity is meant the ability to infer 
the results of studies using a Lafayette memory drum to others using a drum 
by MTA, then little criticism can be leveled against verbal learning investi- 
gators. But if one sees the goals of such research more broadly, as Ebbing- 
haus, James and others among its earliest practitioners did, then the fruits 
of this scholarship must appear quite disappointing. 


If there is a single unit of analysis which distinguishes this domain 
of research (as well as most of the concept learning and problem solving 
literature), it is the trial (Melton, 1963). By any means of analysis, the 
trial must stand as an experimentally created artifact, devoid of the barest 
semblance of external validity. It is remarkably convenient to organize and 
arrange sequentially the material for learning experiments in the form of 
trials, timed or untimed. Such arrangements yield nicely manageable data in 
the form of “trials to criterion,” “average errors per trial,” and the like. 
Even a large field of “metatrialosophy” can develop, which deals with the 
alternative ways of computing or plotting trial data and the complexities of 
trials by subjects designs. But where, in the actual world of human beings 
attempting to learn new material, attain novel concepts or solve unfamiliar 
problems does one find the external analogue of a trial? For example, 
those learning the vocabulary of a second language in a natural setting 
rarely seem to present themselves with a list of new words, limiting their 
practice to timed exposures in unchanging sequence. Most often they learn 
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new words in the context of ongoing discourse, either explicitly asking for 
a definition or making a shrewd guess from contextual clues. 

To deal with the discontinuity between the settings of research and of 
educational application, a common language or set of terms for character- 
izing both experimental educational settings and curricula is needed. 
Researchers must seriously strive to develop a means of analyzing the 
characteristics of both experimental and school settings into a complex of 
distinctive features, so the task validity of any particular experiment can 
be estimated in terms of the particular criterion setting to which inferences 
are being made. The distinctive features approach was originally developed 
by Jakobson and Halle (1956) in a study of phonetic systems in language 
and was initially applied to educational problems by Gibson (1968) in her 
studies of reading. Gibson and her associates developed a distinctive features 
analysis for the characteristics of letters or graphemes and demonstrated 
that they could use such an analysis to generate many fruitful hypotheses 
concerning problems in learning to discriminate among letters. 

A distinctive features approach uses a minimum of categories in 
combinations to characterize sets of phenomena. Thus, by using only twelve 
facets, such as grave-acute, lax-diffuse or vocalic-non-vocalic in a variety 
of “bundles,” Jakobson and Halle differentiated among all the phonemes 
of all languages. Hunt (1962) discussed a similar strategy for identifying 
the minimum necessary set of features for specifying a particular concept. 
Barton (1955) described the sociologists’ attempts to define the character- 
istics of social “property space,” an analogous problem. 

I envisage ultimately a situation in which use of such a distinctive 
features approach would allow one to characterize the instructional settings 
to which a particular body of experimental research would most 
effectively be applicable. Conversely, one could begin with a curriculum 
of interest and use such an approach to identify critical experiments that 
might be conducted to examine particular features of the complex curricular 
Gestalt. By specifying the precise degree of overlap between the settings 
under investigation, the relevance of particular studies to particular appli- 
cations could be judged more accurately. 

Such research might also stimulate progress in related areas. Currently 
there exists a great deal of argument about the validity of aptitude and 
achievement tests for disadvantaged populations. These debates usually 
focus on whether schools in the inner-city ought to rid themselves of all 
forms of standardized testing. Similarly, the question of using standard 
entrance examinations for disadvantaged college applicants is of major 
contemporary importance (Kendrick & Thomas, 1970). 

There is no uniform response to such concerns. Clearly, a monolithic 
judgment cannot be made on the validity of educational tests. Like ex- 
periments, tests have internal and external validity. When one assesses a 
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test’s reliability, he is measuring its internal validity. When he examines 
what the test maker calls validity, he is looking at what the experimenter 
calls external validity. And like the experiment, the external validity of a 
test must be judged on at least two bases: population and task validity. 


It is possible for members of two subgroups to receive the same average 
score on a standardized test but for this score to be predictive of very 
different consequences for each group. Thus the similarity of scores may 
mask more fundamental underlying differences. The validity of a test 
score is not only contingent upon the correlation which scores on that test 
bear with some criterion in the general population. A generally valid test 
may be differentially valid for different subgroups within the general 
population. Thus, when questioning the validity of a particular test, one 
must always ask, “valid for whom,” (population validity) as well as “valid 
for what?” (task validity). 


In some cases, crossvalidation for new populations and settings will 
confirm the validity of measures that had been held suspect. Thus. Stanley 
and Porter (1967) demonstrated that the Scholastic Aptitude Test (SAT) 
was as valid a predictor of academic performance for black students in 
Southern Negro colleges as it was for white students. Conversely, Shulman 
(1968a) reported that among adolescents classified as mentally handicapped, 
the Stanford-Binet is valid as a predictor of vocational competence only for 
a white middle-class group but not for a black lower-class sample. 


By combining both concepts of external test validity, one can generate 
a “validity matrix” for any test of interest. Such a matrix would cross 
populations and criteria, reporting the empirically demonstrated validity 
coefficients for the joint occurrence of a population with a particular set of 
distinctive features and categories of criterion tasks with their own sets of 
distinctive features. Whenever a new test is standardized or an old one is 
restandardized, educators should insist on a standardization design which 
will yield a validity matrix for that test, rather than a single, generally ~ 
unintepretable, validity coefficient. Such an approach would also be 
consistent with the orientation of the growing body of research on aptitude- 
treatment interactions in learning (Cronbach and Snow, 1969). 


It should be apparent that effectively studying the distinctive features 
of any settings—experimental treatments, curricula or tests—requires a 
carefully developed body of research on the nature of environments per se. 


The educational significance of environment has become a focus 
of scholarly attention in the light of the most recent reformulations 
of the nature-nurture issue (Jensen, 1969). Before I examine the problems 
of weaving the data on environments into the fabric of externally valid 
educational investigations, a brief analysis of the implications of the nature- 
nurture issue for the study of environments may be in order. 
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Environment and Heredity 


Stimulated by recent controversies, too many educators have assumed 
extreme positions with regard to the nature and sources of the environment’s 
impact on the educability of the child. Some lean in the direction of 
treating young pupils as examples of the classical tabula rasa, the clean 
slate of epistemology, upon which anything they write with the stylus of 
instruction should cut deeply and without interference. Others insist that 
educators view most of variability among students in terms of inherited 
individual differences. 


The clear and erroneous implication of the above polarity is that 
declaring a developing characteristic (such as intelligence) “environmental” 
testifies to its infinite plasticity while labeling it “inherited” stains it 
eternally as fixed and immutable. 


The recent period has seen the emergence of a new body of literature 
dealing with such matters, punctuated by Jensen’s (1969) provocative 
assertion that “compensatory education has been tried, and apparently it 
has failed.” (For responses to Jensen’s paper see especially Cronbach, 1969; 
Hunt, 1969; Light and Smith, 1969; and the issues of the Harvard Educa- 
tional Review in which they appear.) Once the unfortunate, if inevitable, 
polemics and recriminations are eliminated, a number of fundamental 
misunderstandings remain which contribute greatly to the confusion on this 
issue. My argument in this section grows from the following propositions: 


(a) There are no grounds for asserting that when the variance in the 
development of a characteristic is not attributable mainly to 
heredity, that it is therefore not amenable to change via environ- 
mental intervention. 

(b) There are no grounds for asserting that when the variance in the 
development of a characteristic is attributable mainly to environ- 
ment, that it is therefore a relatively straightforward and simple 
task, given enough time and resources, to modify that character- 
istic via environmental intervention. 


When an investigator has wished to demonstrate the effects of heredity 
on the development of a trait such as intelligence, he has generally looked 
to the data on heritability. Such demonstrations are many. The correlation 
between the IQs of monozygotic twins is much higher than between 
dizygotic twins, who are essentially not distinguishable from siblings. The 
correlation between the IQs of foster children and their natural mothers 
is always higher than it is with their foster mothers. Roberts (1952) 
demonstrated the strong effect of genetic factors in the analysis of data on 
the contrasting IQs of the siblings of the severely and mildly retarded. 
Such observations say nothing about the changes possible in measured 
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intelligence. They assert only that a substantial portion of the variability of 
intelligence scores can be attributed to heredity. 

Often the same studics are approached by those supporting a strong 
environmental hypothesis, and what appears to be an utterly contradictory 
conclusion is reached. They choose to ignore the correlational findings 
which the first group finds so seductive. Instead, they compare the mean 
IQs of twins raised in different environments, emphasizing how different 
the scores are from cach other. They also take pains to show that correla- 
tions not withstanding, the actual IQs of foster children are generally more 
similar to those of the foster parents than they are to the natural mothers. 


What is important here is that both conclusions are drawn reasonably 
from the data and both are tenable. High correlations can hold between 
two sets of scores while the mean values of the two sets are highly disparate. 
Although heredity appears to make a major contribution to the relative 
rank-ordering of intelligence scores with environment held constant, it is 
the degree of abundance or deprivation of the environment that scems to 
contribute most to the attained IQ status of individuals. 


Much of the heat in the nature-nurture controversy is a function 
of the assumption that a nurtured characteristic is easily changed, while 
a “natured” one is stubbornly resistant to modifications. This is simply not 
the case. 

Research in psychotherapy has found emotional disorders, most likely 
environmental in etiology, remarkably difficult to ameliorate. The cure-rate 
for infantile autism, which at least some theorists (Bettelheim, 1967) insist 
is learned, is depressingly low. Conversely, the effects of phenylketoneuria 
(PKU), an inherited disorder, can be ameliorated by careful control of 
dict. Certain inherited hemalitic anemias, such as spherocytosis, are brought 
under control through chemotherapeutic or surgical procedures. No facile 
generalizations can be made concerning the relationship between the 


sources of variability underlying a trait and the changes that can be 
brought about in its attained status. 


Finally, we must keep in mind the important differences between the 
dat. on IQ, a variable of primarily theoretical interest, and school achieve- 
ment. a variable of great practical significance. All studies show school 
achievement with a significantly lower heritability than IQ (Jensen, 1969). 
School achievement also stabilizes later than IQ, thus remaining sensitive 
to environmental change for a longer period (Bloom, 1964). 


Thus, an understanding of the ways environment influences human 
growth and functioning remains a critical domain of inquiry for the 
educator, whatever his position on nature, nurture and intelligence. Until 
the gene is as manipulable as the printed word, the educator's only tool 
will be the modifiable environment. It is the environment which he must 
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manipulate in his studies and from which he fashions the settings of formal 
instruction. 


The study of environments is only one aspect of the needed reconstruc- 
tion of educational research. In the next section, I discuss a broader 
conception of desirable general directions for educational research strategy. 


Reconstruction of Research Strategy 


Research in education will have to venture forth from the safe and 
sterile surrounding of the traditional laboratory and address itself to that 
most threatening of settings for the educational researcher, the classroom 
or its carefully created equivalent. Instead of viewing experimental treat- 
ments in terms of single variables, such as “phonetic” versus “whole-word” 
or “discovery” versus “rote,” researchers must begin to contrast total 
educational approaches, e.g., curricula or their parts, whose components 
have been carefully selected and combined. 


Cronbach (1966) advocated that an educational tactic be studied in 
its proper context. This context includes its place in a sequence of other 
tactics which have been combined because they are particularly suited to 
cach other. In this sense, the concept of a set of experimental groups which 
are equivalent to each other in all matters save one dimension whose effect 
is being studied, all of which are in turn compared to some “control group,” 
is probably an anachronism. 


A particular educational tactic is part of an instrumental system; 
a proper educational design calls upon that tactic at a certain point 
in the sequence, for a certain period of time, following and pre- 
ceding certain other tactics. No conclusion can be drawn about the 
tactic considered by itself. . . . 


An educational procedure is a system in which the materials 
chosen and the rules governing what the teacher does should be 
in harmony with each other and with the pupil’s qualities. If we 
want to compare the camel with the horse, we compare a good 
horse and a good camel; we don’t take two camels and saw the 


hump off one of them. 
(Cronbach, 1966; Pp. 77, 84) 


Because educational programs are far more complex than the present 
psychological theories which purport to explain the teaching-learning 
process, it might be in the long range interest of both psychological theory 
and education to ignore those theories for the moment and proceed along a 
relatively atheoretical path in the study of education. If educators but 
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look around, they will see that in contrast to their theoretical impotence, 
they do not lack for ideas and even numerous successes in the teaching of 
many things to a wide variety of children. In fact, were there not a fairly 
large proportion of successful teaching experiences with working class 
children, a strikingly large proportion of those currently active in educa- 
tional research would not be there. 

It is not my intent to collapse the distinction between research and 
evaluation, or between “conclusion-oriented” and “decision-oriented” 
inquiries (Cronbach and Suppes, 1969). I am advocating the pursuit of 
different types of conclusion-oriented inquiry aimed at creating theoretical 
formulations that can be useful in guiding educational thought. 


In the following section I review a number of research strategies for 
education. I have not invented them for this paper, nor do I claim that 
they are totally unique or novel. Other workers in related domains (e.g., 
Thoresen, 1969), have independently arrived at a partially overlapping 
list of such recommendations. The strategies I recommend are generally 
not independent alternative approaches but can be seen to fit into a larger 
over-all strategy of research which I would call, after Schwab (1960), a 
“grand strategy” (Keislar and Shulman, 1966, p. 197). 


Epidemiological Strategies 


The epidemiological strategy derives from the kinds of research 
often conducted in the public health field (Rogers, 1965; Sartwell, 1965). 
Often an epidemic will spread rapidly through an area, affecting some 
parts of the population and leaving others unscathed. An important question 
raised in such a situation is what distinguished those who were susceptible 
to infection from those who were left unharmed. Similarly, in studies of 
such social phenomena as delinquency (Glueck and Glueck, 1968), it 
appears that individuals may come from what is ostensibly a common 
environment, with some turning to crime and others turning to more 
socially acceptable activities. 

Researchers are warranted in inferring that while, in either case, the 
two groups in question may appear to come from the same setting, there 
must be some significant differences between them. Although these two 
groups were not created experimentally, questions can be raised by identify- 
ing representative members of each of the two contrasting groups and by 
attempting to analyze back and discover all of the differences between 
them. From such analysis, hopefully would come working hypotheses 
about kinds of purposeful differences in treatments one might develop 
cither to raise the probability of immunity or the development of socially 
acceptable behavior patterns in future individuals. These hypotheses could 
in turn be tested in long-range experimental programs. 
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It is apparent that similar kinds of epidemiological strategy can and 
do work in areas such as the study of effective and ineffective teachers, 
successful and unsuccessful readers, and many others. The important thing 
to recognize is that researchers do not confirm hypotheses in such a manner, 
but rather they generate them. Educators have for too long used only 
armchair or laboratory-based theories for generating hypotheses, or the 
residue of many years of intuitive experience, rather than systematically 
gathering careful descriptive data to generate working hypotheses. Such 
epidemiological strategies could be extremely useful as the early stages of a 
complex program of educational research. The critical attribute of such a 
strategy is examining in detail the background variables- which distinguish 
effective from ineffective performers in a particular domain for the pur- 
pose of using such discriminators as the basis for creating experimental 
treatment or training programs. 


“Grammars of Behavior” Strategies 


A second strategy for using descriptive research to develop a testable 
set of models for education comes from the work done by field linguists as 
they confront previously unexplored aspects of linguistic systems and attempt 
to write adequate grammars for them. When a linguist confronts the prob- 
lem of writing a new grammar, he generally begins by collecting a large 
corpus of speech from selected informants who are designated a priori as 
“native speakers” of that language. It is then asserted that their capacity to 
perform the operations of speaking and hearing that language rest upon an 
underlying system of rules, of which they are generally not conscious, 
which constitute the grammatical competence of the speaker. The linguist’s 
task, once he has collected his corpus of speech, is to subject the data to a 
series of careful analyses in order to discover and make explicit that under- 
lying rule system, which is known as the grammar of the language (Chom- 
sky, 1965). As in my earlier examples, such formulations must subsequently 
be confirmed through cross-validation. 


Analogous processes have been used in psychology for generating 
rule systems for proving mathematical theorems (Newell, Shaw and Simon, 
1958) and for interpreting protocols of the Minnesota Multiphasic Person- 
ality Inventory (Kleinmuntz, 1968). It appears that similar approaches could 
be developed for studying the processes of education. The strategy would 
involve initially identifying criterial educators, who, like the native speaker 
of the language in linguistic studies, are taken to represent some standard of 
excellence as practitioner of the educational arts. Careful descriptive proto- 
cols of the criterial educators’ verbal and non-verbal behavior would then be 
gathered and, using behavioral equivalents of the linguistic rule discovery 
tactics, educational researchers would attempt to write a grammar of their 
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teaching behaviors. This grammar would be a set of rules adequate to account 
for their functioning. I would hypothesize that once made explicit and cross- 
validated, such rules could be used to develop instructional procedures to 
help new students attain the criterial performer's level of competence. Jen- 
kins (1969) reported that he and his co-workers are attempting similar 
grammatical analyses of the behavior of children attempting to perform 
certain of the Piagetian tasks. My colleagues and I (Shulman, 1968b) are 
also working with such a model for the study of criterial performance among 
medical diagnosticians. 


An important lesson to be learned from experience with this kind 
of strategy as used by psychologists is that one does not limit himself to 
observation of the overt physical behavior of the subject alone. The 
evidence from the studies of general mathematical problem solving by 
Newell, Shaw and Simon, of interpretation of the MMPI by Kleinmuntz, 
and of chess playing by De Groot (1965; 1966) has made it abundantly 
clear that psychology’s long dormant interest in introspective data must 
be reawakened. The most effective protocols gathered by these investiga- 
tors included reports of not only what the subjects did, but also what they 
said about what they were doing, while doing it. Although I recognize the 
admonitions concerning introspection raised recently by Hebb (1969), I 
must insist that the evil reputation which introspection has received is 
more a function of its misuse by such distinguished figures as Titchener, 
than it is a function of some intrinsic insufficiency of the approach itself. 


Simulation in Research 


I indicated above the need to increase the task validity of experiments 
by maximizing the similarity between the features of the experimental tasks 
and those of the ultimate transfer settings to which the research findings 
were to be inferred. Techniques of simulation can be extremely useful in 
creating such research settings. Here simulation does not refer to the 
use of games in teaching (Boocock and Schild, 1968) nor to the use of 
computers to mimic men or systems (Kleinmuntz, 1968). An investigator 
using simulation attempts to create an artificial environment that resembles 
the actual invironment of interest as closely as possible. However, careful 
control over every input into the simulation is maintained and elements 
of the situation can be experimentally manipulated as needed. Thus, the 
experimental simulation can serve as the ideal middle ground between the 
artificiality of the typical experimental laboratory and the totally uncon- 
trolled research environment of the behavioral ecologist studying organisms 
in their natural habitats. 


Simulation can be used to create settings for studying widely diverse 
phenomena in education (Twelker, 1969) as well as specific cognitive 
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functions in realistic situations (Shulman, Loupe and Piper, 1968). Simu- 
lation approaches afford a means of taking the working hypotheses generated 
from the descriptive tactics of epidemiological or behavioral grammar studies 
and testing them at an intermediate level between the laboratory and the 
field. 


Multivariate Experimental-Longitudinal Strategies 


Most conceivable schooling situations will possess certain common 
characteristics: (1) They involve the attempt to modify or manipulate a 
setting (with or without teacher) to bring about desired changes in a 
learner; (2) They take place over relatively extended periods of time; 
(3) They involve the simultaneous input of multiple influences and the 
likely output of multiple consequences—some predicated, others not; and 
(4) They are characterized by variability of reaction to ostensibly com- 
mon stimuli, that is, not all learners learn equally or react similarly to 
specific acts of teaching. 


What is described above is a highly complex and variegated activity, 
involving students, subject-matter and sources of instruction in an edu- 
cational setting. Any research which purports to deal systematically with 
phenomena at this level of complexity must itself reflect an appropriate 
level of complexity. The ideal research setting, to be congruent with the 
description of the educational setting offered above, must be (1) experi- 
mental; (2) longitudinal; (3) multivariate at the level of both independent 
and dependent variables, and consistent with that, (4) differential, in that 
the interactions of the experimental programs with the students’ entering 
individual differences are treated not as error variance, but as data of 
major interest in the research. 

In experimental-longitudinal studies the long-term effects of con- 
tinuing educational programs are examined (Carroll, 1965). As he does 
with experiments in general, the researcher attempts to equate the groups of 
subjects at the beginning of the study either through simple or stratified 
randomization. These are unlike the typical experiment because the 
researcher looks at the cumulative effects of ongoing programs rather than 
at the one-shot effects of a single exposure to instruction. Hence, the 
educational program is something that continues for a duration of months 
or years. The evaluation of these programs also is continuous during the 
full course of the investigation, rather than limited to the program’s termi- 
nation. The criterion variables are chosen to cover as wide a range of 
relevant behaviors as possible. This longitudinal quality characterizes cur- 
rent studies of computer-managed instruction (Cooley and Glaser, 1969) 
and computer-assisted instruction (Atkinson, 1968). For this reason alone, 
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these studies are likely to prove educationally relevant, whether or not 
computer hardware is ever seriously used in schools. 


A number of tactics are possible within the general experimental- 
longitudinal framework. Instead of beginning the entire study at a single 
point in time, the researcher can stagger the onset of the program over a 
period of months or years, with different experimental groups entering the 
treatment phase at different stages. In a situation in which the entire popu- 
lation must receive a program for political or social reasons, but justification 
can be made for a gradual implementation, the groups beginning the 
program later can serve as functional controls until their turn comes 
(Campbell, 1969). Also, a staggered longitudinal design has a built-in 
replication mechanism in the groups who begin later. These groups can also 
serve as controls for the possible effect of historical variables on the 
research results. Finally, a staggered design provides the opportunity for 
systematic tinkering with programs in midstream without washing out the 
entire study. In this sense, the problems dealt with in this research strategy 
become similar to those encountered when evaluating on-going school 
programs (Provus, 1969; Brickell, 1969). 

In general, such research can and should use the techniques of 
experimental studies whenever possible. For example, while randomization 
is often a major problem in dealing with the assignment of individual 
students to program components, it can often be more easily achieved if the 
unit of analysis is classrooms or schools. That is, it is sometimes easier to 
Leia assign classrooms or schools than it is to assign pupils (Campbell, 


Testing the hypotheses in experimental-longitudinal designs need not, 
however, stand or fall on the availability of random assignment and formal 
control or comparison groups. Approaches to the employment of correlation 
or regression-based techniques for the assessment of change in non- or 
quasi-experimental settings are increasing in usefulness and availability 
(Campbell, 1963, 1967; Campbell, Ross and Glass, in press; Yee and Gage, 
1968; Land, 1968). 


Designs of the proposed experimental-longitudinal genre will be 
multivariate in their independent and dependent variables. Cronbach 
(1966) pointed out the need for a broad spectrum of outcome measures 
in research on learning by discovery. His recommendations hold equally 
well for research in other learning areas, Campbell (1969) made a similar 
point for quasi-experimental studies in the public policy domain. 


Cattell (1966b) argued that behavioral science research, previously 
moribund and sterile, will make visible progress only when it shakes off 
the constraints of bivariate studies and begins to deal with a multivariate 
universe in a multivariate way. Naturally, Cattell believes that the multi- 
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variate experimental techniques reviewed in his imposing handbook (Cat- 
tell, 1966a) hold the methodological keys to conducting such investigations. 
It is a bit ironic that I recommend this family of techniques to my colleagues 
in education, since Cattell observes that many of these, such as factor 
analysis and multivariate analysis of variance, were initially employed for 
solving educational or psychometric research problems. 


When conducting such studies, researchers must not ignore the teacher 
as an independent variable. Stephens (1967), in a compelling and disturbing 
review of research, argued that no systematic effects for curriculum or 
program are observable in fifty years of educational research. He concluded 
that what determines the effectiveness of a program is that variable generally 
ignored or ostensibly randomized away—the teacher. Instead of pretending 
that teachers are merely sources’ of error variance, researchers must use 
multivariate experimental designs which include teachers’ educationally 
significant characteristics as factors. At the present time, unfortunately, 
researchers have no idea what those characteristics are. Epidemiological or 
behavioral grammar studies may be useful in identifying those features. 

In summary, researchers have fallen into the habit of conducting 
educational research in the classical two-group experimental mode or in the 
simple correlation descriptive mode. Examination of the true state of 
educational problems reveals that such approaches are extremely short on 
educationally relevant external validity. The tools and settings exist, how- 
ever, for significantly improving such research through use of multivariate 
experimental-longitudinal designs of many kinds. 


Replication 
The time has arrived for educational researchers to divest themselves 
of the yoke of statical hypothesis testing and to assign the test of significance 


to its proper role. Tukey (1969, p. 85) observed that the use of statistical 
significance testing was never meant to serve as & substitute for replication: 


The modern test of significance, before which so many editors 
of psychological journals are reported to bow down, owes more to 
R. A. Fisher than to any other man. Yet Sir Ronald’s standard of 
firm knowledge was not one very extremely significant result, but 
rather the ability to repeatedly get results significant at 5%. 

Repetition is the basis for judging variability and significance 
and confidence. Repetition of results, each significant, is the basis, 
according to Fisher, of scientific truth. 


I would recommend the following general strategy for establishing the 
educational significance of an experiment: (1) Identify the magnitude of 
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the treatment effect that will be considered significant in a particular study; 
(“To be meaningful, this tactic should lead to an increase of at least two 
reading levels within 6 months.”) (2) Establish the proportion of subjects 
who must achieve the desired magnitude of change for the results to be 
considered significant. (“At least 60% of the pupils should achieve the 
meaningful level of change.”) (3) Report the findings in terms of the 
proportion of subjects who actually do achieve the desired magnitude of 
change, as well as in terms of the usual measures of mean difference. Apply 
techniques analogous to statistical significance testing at this stage to 
answer questions of inferential stability. In such studies, these “tests” will 
resemble estimation procedures such as confidence intervals, much more 
than classical significance testing. (4) Employ a wide range of entering 
individual difference measures to provide a basis for generating subsequently 
verifiable hypotheses about the characteristics of those subjects who profit 
especially from the treatment, and those who do not. (5) Employ a broader 
range of criterion variables than would be suggested directly by the inde- 
pendent variable alone. The greatest value of a program could turn out to 
be in its effect on categories of change never considered in its initial 
development (Campbell, 1969; Schwab, in press). (6) Whatever the find- 
ings, replicate them. Make replication as integral a part of the research 
designs as posttesting or data analysis. Researchers are so unused to 
conducting replications that many of their current conceptions of replica- 
tion turn out, after critical scrutiny, to be disturbingly naive and simplistic. 
Cronbach’s (1968) discussion of the problems of replicating findings on 
ey and intelligence at different age levels illustrates these problems 
well. 


Closing the Method Gap 


Cattell (1966a) observed that progress in science is almost always 
presaged by methodological innovations or breakthroughs. He cites as 
examples the telescope, microscope and factor analysis, among others. Al- 
though Kuhn (1962) would doubtless remind us of the frequency with 
which new pardigms are created through empirically stimulated theoretical 
rather than methodological reorientations, Cattell’s point remains a com- 
pelling one. It also raises an important question. 


The present era is one of significant methodological progress in the 
behavioral sciences and education. The development of new techniques, 
especially in the multivariate domain, proceeds at a rate which dazzles the 
non-specialist, even though in the eyes of the educational statistician most 
of the “new developments” are merely variations on a few major themes. 
Why then do researchers not observe the application of such techniques in 
the conduct of substantive research? If anything, the gap between the 
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methodologist and substantive researcher in education has grown wider 
in the past decade. Why has not this undeniable progress in research 
techniques given birth to the badly needed end of educational research’s 
Dark Age? 

The microscope, telescope, and factor analysis were not created by men 
who considered themselves primarily to be methodologists. They were 
scholars confronted by compelling research problems for which their extant 
methods could provide no solutions. Hence, from substantive puzzlement 
grew methodological innovation. For all practical purposes, there were not 
two groups—the methodologists and the substantive researchers. There were 
only researchers, striving to develop methods appropriate to solve the key 
problems confronting their disciplines. 

In the present era this has changed. Educationists currently train 
“research design specialists” almost independently of “learning and develop- 
ment men.” The disaffection is two-way. Not only do the methodologists 
tend to disdain the necessity of gaining understanding in the substantive 
domains; the substantive researchers react defensively by denying the im- 
portance of methodological advances and by leading a generalized retreat 
to Chi Square. 

It is rather ironic that most medical schools require at least one year 
of college mathematics for their entering medical students, although the 
typical medical curriculum makes woefully little use of any mathematical 
expertise. Yet, those in educational research, a domain which is becoming 
increasingly mathematics-involved, generally require no mathematics pre- 
requisites for graduate students in areas like educational psychology. 

Educational research training programs and many institutional organ- 
izations perpetuate this method gap. Unless it is breached, educational 
researchers are not likely to become capable of conducting the kinds of 
externally valid experiments discussed earlier in this paper. 


The Organization of Research 


The need for conducting educational research on a multivariate and 
longitudinal scale has been recognized by a number of scholars in education. 
These men suggested many specific strategies for dealing with the problems 
created by such research (Carroll, 1965; Thoresen, 1969; Baker, 1967). 
Studies of this magnitude are in vivid contrast to the type most often 
reinforced in academic circles. These latter studies are usually short, quick, 
and speedily analyzed. The experimental treatments can be administered 
in a matter of minutes or hours and the results are assessed immediately 
thereafter. They are as unlike the form of experiment I have been discussing 
as they are unlike anything that happens to children in real classrooms. 
Yet, on the grounds that the greater complexity of classroom activities 
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constitutes no more than a concatenation of these more simple operations 
sequentially linked together to form a curriculum, the argument is made 
that such strategies of research are justified for educational investigations. 

Researchers must recognize, however, that there is also a network of 
institutions which work to reinforce this approach to research. As long as 
the academic setting is one where the patterns of reinforcement are con- 
tingent upon the number of articles published by an investigator, rather 
than the relevance and quality of his investigations; as long as the educa- 
tional researcher is encouraged to operate as an individual entrepreneur, 
rather than as part of a research team; as long as the overwhelming 
proportion of studies conducted in education are one-shot doctoral disserta- 
tions designed to collect the most data in the shortest time, educational 
researchers will continue to produce research which is of little value to 
developing educational theory. What is needed is more than a change in 
the way in which the next generation of researchers is trained. A revolution 
is needed both in the structure of the research enterprise and in the kinds of 
criteria utilized by administrators who judge the quality of academic 
performance and parcel out the subsequent monetary and status rewards. 

Tukey (1969) made the observation that research in psychology 
continues to operate as if the individual Ph.D. thesis served as the prototype 
for scholarly efforts in the field. Education suffers from the same malady. 
Tukey commented, “Other sciences have faced the transition from the Ph.D. 
thesis that stood on its own feet to the Ph.D. thesis that is part of a bigger 
entity. No Ph.D. builds his own cyclotron as part of his thesis. No Ph.D. 
orbits his own satellite to get his data” (Tukey, 1969; p. 88). He recom- 
mended increased cooperative efforts in psychological research, with Ph.D. 
theses planned as part of broader general programs of research engaged in 
collectively by their thesis supervisors. 

Education must echo Tukey’s call for reorganization of the structure 
of the educational research enterprise, leaving behind the model of the 
individual entrepreneur and replacing it with the coordinated, institution- 
ally-supported research team. (Are we now to to hear cries against “social- 
ized educational research”?) My experience with several R & D Centers 
suggests that too often these ostensibly team operations are merely fiscal 
umbrellas facilitating the same kinds of individual and independent investi- 
gation conducted unprofitably for years. Perhaps it should be no surprise 
that the research “rockets” launched individually so often abort. 


Summing Up 


Has educational research been tried? Has it failed? Surely a kind of 
educational research has been attempted without notable success. But from 
its often clumsy gropings (or its ever-so-precise figure eights, carefully 
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retracing well-worn patterns) have emerged general strategies and approach- 
es that could show promise for the future. 


Educators need not feel uniquely culpable because of the necessity 
to make dramatic shifts in their most basic tactics of investigation. Even 
the hitherto impregnable fortress of experimental psychology is currently 
being shaken by that most devastating of saboteurs—the critic from within. 
Such eminent experimentalists as James Jenkins (1968) and James Deese 
(1969) are challenging not only the usefulness of classical S-R behavior 
theory, but the very behavorial emphasis of psychology itself and the future 
fruitfulness to psychology of the traditional methods of experimentation. 
That both Deese and Jenkins were nominated in 1969 for the presidency of 
the Division of Experimental Psychology of the American Psychological 
Association makes the situation doubly ironic. In all of the behavioral 
sciences, well-worn paradigms are being called into question as accomplish- 
ment falls far short of promise (Cattell, 1969a). Educational researchers 
are simply sharing in the long overdue discomfort of their parent and sister 
disciplines. 

The solutions to these difficulties will not arise from a mere reshuffling 
of the “in” research designs. The approaches employed for training the 
next generation of educational researchers must change radically. Even more 
important, the structure of the educational research establishment must be 
significantly modified to create the necessary conditions for research in 
education. 
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EFFECTS OF PICTURES ON LEARNING TO 
READ, COMPREHENSION AND ATTITUDES 
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University of Minnesota 


If fish were to become scientists, the last thing they might 
discover would be water. Similarly, researchers have too often failed to 
investigate important aspects of their environment because being immersed 
in it, they fail to notice certain components of it; or, having noticed a 
component, they simply assume that it must be that way. One such example 
from reading is the ubiquitous use of illustrations in books for beginning 
reading instruction. Today nearly all children are taught to read from 
books containing pictures. In this country this practice goes back at least 
to 1729 when the New England Primer incorporated pictures with the 
text. In Europe during the 1650's, Comenius used pictures in his Orbis 
Pictus to teach reading. Only occasionally does one find a writer (Dechant, 
1964) questioning the use of pictures or a published reading series, such as 
the Bloomfield and Barnhard (1963) readers, attempting to teach reading 
without pictures. Teachers usually accept the fact that beginning reading 
texts have pictures without wondering what effect pictures have on learning 
to read. 


In its yearly summary of investigations related to reading, the Reading 
Research Quarterly reports nearly 400 studies. Surprisingly, only a fraction 
of the studies were concerned with pictures or illustrations. This review is a 
summary of those studies in which researchers have investigated the effects 
of pictures on (a) learning to read, (b) comprehension, and (c) attitudes. 
In addition, specific suggestions for needed research are included. To 
delineate the area for review, preference was given to those studies in which 
pictures were used as adjuncts. Pictures are considered adjuncts if the text 
can be comprehended or the objects of the lesson fulfilled when the pictures 
are removed. 


The effect of pictures on learning to read words 


Teachers of beginning reading use the pictures in primers to build 
background for the story, to introduce the meaning of new words to be 
learned, and as prompts when children can not recognize printed words. 
In the last instance (when a student is unable to identify a word), the 


397 


REVIEW OF EDUCATIONAL RESEARCH Vol. 40, No. 3 


teacher often suggests that the student look at the picture for a cue to the 
identification of the word. Although pictures may be used as prompts when 
the student cannot recognize a word in the text, pictures may miscue and 
divert attention from the printed word. 


To determine the effect of pictures on learning to read words, Samuels 
(1967) conducted a laboratory and a classroom study. In the laboratory 
study, two groups of randomly-assigned kindergarten children learned to 
read four words (boy, bed, man, car) either with a picture or without a 
picture. For learning trials in the picture condition, a relevant picture was 
placed directly above the printed word on the card used for instruction. For 
example, the word boy was at the bottom of the card and above it was a 
picture of a boy. In the no-picture condition, only the printed word 
appeared on the card. The task for the subjects in both conditions was to 
learn the appropriate oral response associated with the printed stimulus. 
Feedback was given during learning trials. Each learning trial was fol- 
lowed by a test trial. On the test trials only the printed stimulus was on 
the card, and no feedback was given. 


The results of the laboratory study indicated that on the learning 
trials, significantly more correct responses were given by the group to 
which a picture was presented. However, on the test trials when pictures 
were not presented as prompts, the no-picture group gave significantly more 
correct responses. 


In the laboratory study, the investigator tested the effect of pictures 
on naive subjects under conditions unlike those found in classrooms. Since 
laboratory studies are often subject to the criticism that they could not 
be replicated in less artificial situations, the classroom study was done using 
first-graders with seven months of reading instruction and with a procedure 
similar to that used in the classroom. 


Special reading material was prepared for the classroom study (Samuels, 
1967). A story titled “Fun at Blue Lake” with fifty different words was 
printed and attached to the right face of a book cover. For the picture 
condition, the left face of the cover had a picture showing a lake, forest, 
cabin, and people at the shore. In the no-picture condition, the left face of 
the book was blank. Subjects for the study were individually pretested, 


matched according to the pretest score, and randomly assigned to a picture 
or to a no-picture condition. 


To insure that the teaching procedure was the same for the picture 
and no-picture conditions, subjects in both groups were instructed at the 
same time by the same teacher. The only difference was the presence or 
absence of pictures. The instruction followed recommended procedures: 
motivation and building background for the story, silent reading, and oral 
reading. Immediately following reading instruction, the posttest was given. 
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The analysis indicated that there was no difference in learning between 
the picture and no-picture condition for the better readers. Among the 
poorer readers, those in the no-picture condition learned significantly more 
words. Post-experimental interviews with the teachers indicated that during 
the first five months of reading instruction, the teachers had encouraged 
their students to use the pictures as cues to word identification. However, for 
two months prior to this study, the teachers had instructed their students 
to ignore the pictures. As the data indicated, pictures had no effect on the 
better readers, but among the poor readers, the presence of pictures inter- 
fered with learning a sight vocabulary. Although this study was not 
designed to give evidence of how students were using the pictures, it is 
probable that the less able readers were using the strategy first suggested 
by their teachers, that is, looking to the picture for a cue when having 
difficulty identifying a word. 


The finding that pictures interfered with the reading attainment of 
poorer students, but not with better students, is especially interesting in light 
of the studies by Silverman, Davids, and Andrews (1963) and by Baker 
and Madell (1965). They found that when distracting stimuli were present, 
the performance of underachievers suffered greater disruption than did the 
performance of more capable students. 


Two additional studies bear directly on the problem of the effect of 
pictures on learning a sight vocabulary. Braun (1969) tested the view 
commonly held among educators that multi-sensory stimulation (¢.g., 
pictures and words presented together) results in better learning than does 
single-sensory stimulation. To test this assumption, Braun randomly as- 
signed 240 kindergarteners cither to a picture or no-picture condition; 
learning and test trials were alternated. In comparing the two conditions, 
Braun found that subjects in the no-picture condition acquired the sight 
vocabulary significantly faster than subjects in the picture condition in 
seven of cight comparisons. 


The Harris (1967) dissertation was similar in design and procedure to 
the Braun study; the major difference was that subjects in the Harris study 
were from a low socio-economic background. Harris found that subjects 
in the no-picture condition learned significantly faster on four of eight 
comparisons. On the other four comparisons, statistically significant differ- 
ences were not found. Harris attributed the failure to reach significance 
on more comparisons to the generally low level of learning for all his 
subjects, regardless of condition. In the Harris and Braun studies, all com- 
parisons which did not reach significance were in the predicted direction, 
ic., faster learning for subjects in the no-picture condition. 


The Samuels (1967), Braun (1969) and Harris (1967) studies were 
similar in that the subjects had to learn a sight vocabulary by the look-say 
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method and that training and test trials were alternated. During training 
trials, pictures and words were presented together for the picture conditions. 
During test trials, only the printed words were presented. The results 
were also similar in that the investigators in the three studies found that 
when pictures and words were presented together, pictures interfered with 
learning to read the words. 

Attentional processes and the principle of least effort (Underwood, 
1963) explain why pictures interfere with learning to read. The learning 
task set up for the subjects in these studies was essentially a paired-asso- 
ciated task, i.e., learning to associate a common English word with a printed 
stimulus. The stimulus that was presented to the subjects was complex; it 
consisted of a picture, which could elicit the correct verbal response by 
itself, and a printed verbal stimulus, which could not elicit the correct 
response when first presented. Since the printed stimulus could not elicit the 
correct response at first, the function of the picture, from a teacher’s 
point of view, was to prompt the correct response. The problem of getting 
the child to learn to read the word is one of shift in stimulus control, from 
the picture to the printed stimulus. At first only the picture can elicit the 
desired response; then there must be a shift to the printed stimulus as 
the cue which elicits the response. However, given two stimuli, one which 
can easily elicit the correct response (the picture) and one which can not 
(the printed stimulus), the principle of least effort operates. 


The principle of least effort is that when a complex stimulus is 
presented to a subject, he will select that aspect of the total stimulus 
which most easily elicits the correct response. For example, in a study 
recently reported by Samuels (1968), college-age subjects were given a 
list of ten high-stimulus similarity paired-associates to learn. During learn- 
ing trials, nine of the stimuli were printed in black letters and a tenth was 
printed in red letters. During test trials, all the stimuli were printed in 
black letters. Subjects were told that the same letters, regardless of color, 
would be presented on learning and test trials, The findings were that 
during learning trials, significantly more correct responses were given to 
the stimulus in red. But on test trials, when color cues were removed 
significantly more correct responses were given to the stimuli that were 
alw ays printed in black letters. It seems that for the stimulus in red, the 
subjects were responding to the color not the letter shape. Consequently, 
when the color cue was removed on the tests, the subjects were unable to 
respond. The principle of least effort can be seen operating in that during 
learning trials, when letter shape and color were both present as part of the 
stimulus complex, the subjects selected color—as the easier cue—to focus 
their attention on, since it most easily elicited the correct response. Because 
of the salience of the color cue, the subjects seemed unable to shift attention 
to the less salient but more relevant cue, which was letter shape. 
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Similar processes explain why pictures interfere with learning to read. 
When pictures and printed words are presented together, it is the picture 
which readily elicits the desired response. Instead of looking at the printed 
word, the subject attends only to the picture. Apparently, the shift in atten- 
tion or stimulus control from picture to word fails to take place with some 
pupils. 

Other studies have used pictures to teach reading and have reported 
problems associated with their use. McNeil and Keislar (1962) used pic- 
tures as response alternatives in a reading program. They found that if a 
child liked a picture, he might select it even if it were the wrong response. 
Fowler (1962) also had difficulty using pictures as prompts. In nearly all 
cases, he found that a word minus a picture was superior to a word plus a 
picture. Hall (1961) and Ellson et al. (1962) reported that an important 
shortcoming of pictures as prompts is that only a limited number of words 
can be presented pictorially. Generally, concrete nouns and a limited num- 
ber of adjectives and verbs can be illustrated. An additional shortcoming of 
pictures as cues is that they can not reliably elicit the same response from all 
children; for instance, when shown a picture of a plane, one child may say 
“plane,” another “airplane,” and a third, “jet.” 

Several researchers have attempted to teach a sight vocabularly by 
using a fading technique (Popp and Porter, 1960; Taber and Glaser, 1961; 
Evans, 1963). In this technique a picture, which elicits the desired response, 
and a printed word, which does not, are shown together. The learner is 
supposed to visually attend both to the printed word and to the picture 
which prompts the desired response. Portions of the picture are gradually 
removed—or faded out—to affect a shift in stimulus control of the response 
from the picture to the printed word. If the technique is successful, the 
student should be able to give the correct response when the picture is 
completely removed and only the printed word remains, When this occurs, 
it can be said that there has been a shift in stimulus control from the pic- 
ture to the word. 

Although Taber and Glaser (1961) used the fading technique to teach 
successfully a sight vocabulary, Duell and Anderson (1967) failed to repli- 
cate the Taber and Glaser results. The procedure used by Duell and Ander- 
son was one in which eight color words were paired with radiating lines 
that were of the color named by the word. In nine trials the lines were di- 
minished in size until only a dot remained. On the test, only the printed 
word was presented. According to Duell and Anderson, the problem with 
the fading technique is that the expected shift in attention and stimulus 
control from the color prompt to the printed word may fail to take place. 
Often just a small portion of the prompt elicits the correct response, and the 
learner continues to focus on the fragment of the prompt rather than the 
printed word. Duell and Anderson (1967, p. 79) wrote that “.. . the small- 
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est prompt used—for instance, a dot of red—is just as discriminable as a line 
of red so that discrimination could still easily be made on the basis of the 
prompt alone. It may be that since the subject was not forced to shift from 
the prompt to the cue, he never did.” The Anderson and Faust (1967), 
Faust and Anderson (1967), and Duell (1968) studies demonstrated that 
prompted training sequences designed to force the student to notice the 
printed word stimulus are more successful than programs which do not 
force the student to notice the printed word. 


There is general agreement that pictures interfere with the acquisition 
of a sight vocabulary. King and Muehl (1965), however, concluded from 
their study that: (a) pictures have neither a positive nor negative effect 
when the words to be learned are dissimilar, and (b) pictures are an aid in 
learning to read when the words to be learned are similar, In their study, 
the dissimilar words were gate, drum, nest, and fork; the similar words were 
doll, bell, ball, and bowl. Their conclusion is probably valid only when 
the procedure used in their study is followed. The procedure used in teach- 
ing the words to all the groups, regardless of whether a picture was present 
or not, was one in which the child’s attention was drawn to the printed 
word. If the training method included a picture, after two seconds of expo- 
sure to picture and word, the picture was covered and the child was in- 
structed to look carefully at the printed word. Under usual classroom 
conditions, there is no one to cover the picture in the book and remind the 
child to look at the printed word. 


An interesting question which may be raised regarding the King and 
Muehl (1965) study is what function pictures proba in Pcie the 
learning of the words ball, bell, bowl, and doll. Not only are these words 
visually similar, but when pronounced, they are acoustically similar. One 
can argue that the function of the picture was to clarify for the child which 
word was being pronounced by the experimenter. To find out whether it is 
the visual dimension, the auditory dimension, or both, which is affected by 
pictures, one needs only do a study using a 2X2X2 factorial design in which 
presence or absence of pictures, high or low visual similarity for the printed 
words, and high or low auditory similarity for the responses are varied. 


The effect of pictures on comprehension 


One of the reasons given for illustrating books is that pictures are sup- 
posed to increase comprehension. To test this assumption, Miller (1938, p- 
677) stated that the purpose of his study “. , . was to determine whether 
children who read a basal set of primary readers with the accompanying 
illustrations secure greater comprehension of the material read than do 
pupils who read the same material without the accompanying. pictures.” 
His data showed that children understood what they read just as well with- 
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out pictures as with pictures. Miller’s test questions required the students 
to select from a group of words the one spoken by the teacher, to identify 
from two phrases the one pronounced by the teacher, to cross out the word 
which did not belong among a group of three words, to complete sentences 
after reading a paragraph, and to put in proper sequence the events in a 
paragraph. Although one of the criticisms of this early study is that some 
of the questions are not measures of what is today called comprehension, 
some of the questions did test what is presently called comprehension. 

Halbert (1943) investigated the effect of pictures on recall of relevant 
ideas of a story. She found that pictures were an aid to recall. However, 
recall is somewhat different from the ordinary meaning of. “comprehension.” 
Halbert recognized this and commented, “To the extent that memory for 
ideas is a measure of comprehension, to that extent pictures contribute to 
the comprehension of reading materials” (p. 7). 

Vernon’s (1953, 1954) two studies were done to determine the effect of 
pictorial stimuli on the recall and comprehension of written text. Unlike 
Halbert’s findings (1943) that pictures aided recall, the results of the Ver- 
non studies indicated that, in general, an illustrated version was neither 
remembered nor comprehended better than a non-illustrated version. 

Bourisseau, Davis and Yamamoto (1965) wrote that pictures are used 
in instructional materials under the assumption that they facilitate learning 
by appealing directly to one’s five senses. To test this assumption, subjects 
were presented either printed words or pictures representing the words. The 
subjects were asked to respond to the pictures or the words with the first 
word which came to mind. These responses were then categorized into a 
sense impression category (i.e., the response indicates sight, smell, sound, 
taste, touch) or a non-sense impression category. Simply comparing the 
responses given to the pictorial and the word conditions, one finds signifi- 
cantly more sensory responses given to printed words, The authors con- 
cluded that although many people believe that one picture is worth a 
thousand words, so far as sensory responses to pictures is concerned, this 
assumption is unwarranted. In this study responses to pictures and words 
were investigated, but the effects of color, complexity, details, and photo- 
graphs versus drawings in eliciting sensory responses is still unexplored. 

Weintraub (1960) investigated the effect of pictures on comprehension. 
He had second graders read materials either with a picture or without a 
picture. After they read the materials, the students were given multiple- 
choice questions that measured details and main ideas. Weintraub found 
that comprehension scores were higher when pictures were not present. 

_ Koenke (1968) studied the effects of content-relevant pictures and 
instructions to use the pictures on the subject’s ability to identify the main 
idea of a paragraph. The four treatments were: (1) paragraph only—no 
Picture, (2) paragraph and picture—no directions to use the picture, (3) 
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paragraph and picture—minimum directions to use picture, (4) paragraph 
and picture—maximum directions to use the picture. Subjects for this study 
were third and sixth graders. Koenke found that the addition of a picture— 
either with or without directions to use it—did not enhance the student's 
ability to state the main idea of a paragraph. 

If a picture is to enhance comprehension, it must be able to convey 
information that is relevant to the questions asked on a test. Even if a 
picture does contain test-relevant information, the student may have diffi- 
culty determining which part of the picture is relevant and which part is 
not. Lindseth (1969) investigated the extent to which first, second, and 
third graders were able to answer comprehension questions on a develop- 
mental reading series that accompanied the stories upon which the ques- 
tions were based. She found that the children were not able to answer 
comprehension questions solely by looking at pictures. 


The effect of pictures on attitudes 


Two studies are available that dealt with the effect of pictures on atti- 
tudes. In Vernon’s (1953) study on pictures and comprehension, she noted 
that in several cases, pictures produced an emotional impact that might 
affect attitudes toward the problems described in the text. However, Vernon 
recognized that these attitudes would not necessarily lead to reasonable sug- 
gestions for specific courses of action. 


: An intriguing study on change of racial attitudes through the use of 
pictures was done by Litcher and Johnson (1969). This study is intriguing 
because of the ease with which attitude change was achieved. Second grade 
children were assigned either to a multiethnic basal reader or a regular 
basal reader. The written content was the same for both groups; only the 
pictures used to illustrate the stories were different. In the multiethnic 
reader, some of the characters were nonwhite; in the regular reader the 
characters were white. 


The study employed a pretest-posttest, control-group design. Four 
teachers were randomly selected for the study. Each teacher had in her 
classroom a reading group using a multiethnic reader and a different group 
using a regular reader. The teachers did not refer to the multiethnic read- 
ers nor did they encourage any discussion of race. Both groups were taught 
reading using standard methods. 


Prior to the experiment, the subjects were asked questions which indi- 
cated racial attitudes. Four months later, after using either the multiethnic 
or the regular basal readers, the students were asked the same questions 
again. Changes in the two groups were compared. The results of the study 
clearly indicated that the group with the multiethnic reader had experienced 
far more favorable attitude changes towards Negroes. 
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This study raises several questions. Can similar results be produced 
in higher grades than the one used here? Secondly, the setting for the study 
was an upper-midwestern city in which Negroes represent only .5% of the 
population. Could similar results be obtained in a city where the students 
have more direct contact with Negroes and the racial balance is more even? 
Finally, can this study be replicated? 


Summary and conclusions 


1. The bulk of the research findings on the effect of pictures on acqui- 
sition of a sight-vocabulary was that pictures interfere with learning to read. 

2. There was almost unanimous agreement that pictures, when used as 
adjuncts to the printed text, do not facilitate comprehension. 

3. In the few studies done on attitudes, the consensus was that pic- 
tures can influence attitudes. 

Should pictures be used as adjuncts to printed text? The answer de- 
pends on the objectives. If the objective is to promote acquisition of a sight 
vocabulary, the answer would seem to be “no.” If the objective is to facili- 
tates comprehension, the answer is less definite. Although the research, in 
general, docs not show that pictures aid comprehension, neither does it 
show that it hinders comprehension. Much research still needs to be done 
on the effect of pictures on attitudes. 

One argument for including illustrations with basal readers and with 
other books used to teach reading is that attractive pictures may help a child 
develop positive attitudes toward reading. Learning to read is a difficult task 
for many children, and it is possible that attractive pictures which accom- 
pany the text may make the task of learning to read a bit more pleasant. 
Unfortunately, not a single study has come to this author's attention which 
can answer the question of the effect of pictures on attitudes towards reading. 
A study to answer this question would be casy to design and should be 
done. If a study such as the one suggested here were to be done, and if it 
were found that presence of attractive pictures made reading more enjoyable, 
the educator might find himself on the horns of a dilemma; that is, whether 
to keep pictures out of books to facilitate learning to read or to include 
pictures to build favorable attitudes toward reading. A solution which might 
satisfy both objectives might be to keep pictures out of the child’s text and 
to give the teacher an appropriate series of large and colorful pictures for 
cach story. The teacher might then show and discuss each picture as she 
builds a background for the story. When the children actually read, the 
pictures would be put away. Still another solution might be to include pic- 
tures in the student’s book, but to organize the book so that pictures and 
text were separated in such a manner that the pages with printed text 
would be free of pictures. 


405 


REVIEW OF EDUCATIONAL RESEARCH Vol. 40, No. 3 


References 


Anderson, R. C., & Faust, G. W. The effects of strong formal prompts in programmed 
instruction. American Educational Research Journal, 1967, 4, 345-352. 

Baker, R. W., & Madell, T. O. A continued investigation of susceptibility to distraction 
in academically underachieving and achieving male college students. Journal of 
Educational Psychology, 1965, 56, 254-258. 

Bloomfield, L., & Barnhart, C. L. Let’s read. (Parts 1-3) Bronxville, N.Y.: CL. Barnhart, 
1963. 

Bourisseau, W., Davis, O. L., Jr., & Yamamoto, K. Sense-impression responses to differing 
pictorial and verbal stimuli. Audio-Visual Communication Review, 1965, 13, 249-254. 

Braun, C. Interest loading and modality effects on textural response acquisition. Reading 
Research Quarterly, 1969, 4, 428-444. 

err E. V. Improving the teaching of reading. Englewood Cliffs, N.J.: Prentice-Hall, 

Duell, O. K. An Analysis of prompting procedures for teaching a sight vocabulary. Amer- 
ican Educational Research Journal, 1968, 5, 675-686. 

Duell, O. K., & Anderson, R. C. A failure to teach a sight vocabulary by vanishing literal 
prompts. Paper read at the annual Meetings of the American Educational Research 
Association, New York, 1967. 

Ellson, D. G., Engle, T. L., Barber, L., & Kempworth, L. Programmed teaching of ele- 
mentary reading. A progress report. Bloomington: University of Indiana, 1962. 
Evans, J. L. A behavioral approach to the teaching of phonetic reading. Paper presented 

at the International Reading Association Convention, Miami, May 1963. 

Faust, G. W., & Anderson, R. C. The effects of incidental material in a programmed 
Russian vocabulary lesson. Journal of Educational Psychology, 1967, 58, 3-10. 

Fowler, W. Teaching a two-year-old to read: An experiment in early childhood learning. 
Genetie Psychology Monographs, 1962, 66, 181-283. 

Halbert, M. An experimental study of children’s understanding of instructional materials. 
Bureau of School Service, University of Kentucky, 1943, 15, (4), 7-59. 

Hel ane Jr. Sound and Spelling in English. Philadelphia: Chilton Co. - Book Division, 

Harris, L. A. A study of the rate of acquisition and retention of interest-loaded words 
by low socio-economic kindergarten children. Unpublished doctoral dissertation, Uni- 
MD Mad ie 

King, E., uehl, S, Different sensor ids i inni i ing 
Teacher, 1965; 19, 163.168, y cues as aids in beginning reading. The Reading 

Koenke, K. R. The roles of pictures and readability in comprehension of the main idea 
of a paragraph. Paper presented at the annual Meetings of the American Educational 

_ Research Association, Chicago, February 1968. 

Lindseth, M. L. The use of pictures to answer comprehension questions in a selected 
seo iia reading series. Unpublished master’s thesis, University of Minnesota, 

} sreedih ee $ Ligaen D. hs Changes in attitudes toward Negroes of White ele- 

ar Oo! students al i i i 
Peychalogg: 1900; 60, Mest gg use of multiethnic readers. Journal of Educational 

McNeil, J. D., & Keislar, E. R. Value of the oral response in beginning reading: An ex- 
perimental study using programmed instruction, Grant No. 1431, Washington, D.C.: 
om Office of Education, Department of Health, Education and Welfare Report, 


go Reading with and without pictures. Elementary School Journal, 1938, 38, 676- 


406 


SAMUELS EFFECTS OF PICTURES ON LEARNING TO READ, COMPREHENSION AND ATTITUDES 


Popp, H., & Porter, D. Programming verbal skills for primary grades. Audio-Visual 
Communication Review, 1960, 8, 165-175. 

Samuels, S. J. Attentional process in reading: The effect of pictures on the acquisition of 
reading responses. Journal of Educational Psychology, 1967, 58, 337-342. 

Samuels, S. J. Relationship between formal intralist similarity and the Von Restorff 
effect. Journal of Educational Psychology, 1968, 59, 432-437. 

Silverman, M., Davids, A., & Andres, J. M. Powers of attention and academic achieve- 
ment. Perceptual and Motor Skills, 1963, 17, 243-249. 

Taber, J. I., & Glaser, R. An exploratory evaluation of a discriminative transfer learning 
program using literal prompts. Cooperative Research Program Project No. 691 
(9417). Pittsburgh: University of Pittsburgh, March 1961. 

Underwood, B. J. Stimulus selection in verbal learning. In C. N. Cofer & B. S. Musgrave 
(Eds.), Verbal Behavior and Learning, Problems and Processes. New York: McGraw- 
Hill, 1963. Pp. 33-48. 

Vernon, M. D. The value of pictorial illustration. British Journal of Educational Psy- 
chology, 1953, 23, 180-187. 

Vernon, M. D. The instruction of children by pictorial illustration. British Journal of 
Educational Psychology, 1954, 24, 171-179. 

Weintraub, S. The effect of pictures on the comprehension of a second grade basal 
reader. Unpublished doctoral dissertation, University of Illinois, 1960. 


AUTHOR 


S. JAY SAMUELS Address: University of Minnesota, Minneapolis, Minnesota Title: 
Associate Professor of Educational Psychology Age: 40 Degree: Ed.D., University of 
California, Los Angeles Specialization: Verbal Learning; Reading. 


PROFESSIONAL ROLE DISCONTINUITIES 
IN EDUCATIONAL CAREERS 


Herbert J. Walberg’ 
Harvard University 


Oscar Wilde wrote: “In this world there are only two 
tragedies. One is not getting what you want, and the other is getting it.” 
His irony may not only apply to childhood dreams, but tothe incongruity of 
anticipation and reality in adult careers. With this possibility in mind, my 
purpose here is to review some research on role transition among educators, 
specifically college presidents and teachers in the lower schools. In doing 
so, it seems useful to analyze their personalities and roles in terms of Jacob 
Getzel’s socio-psychological theory of behavior and to construe their organ- 
izations, schools, and universities, as bureaucracies in Max Weber's sense. 


The Collegium and Bureaucracy 


Weber’s theory of the inexorable rise of bureaucracy (1947) permeates 
much of the analysis of the development of institutions. He argued that be- 
cause modern institutions have increased in size and complexity and because 
formal organization incorporates more and more power, bureaucracy has 
encroached on traditional forms of governance, the collegium. According to 
Weber, membership in the collegium is restricted to those who have under- 
gone special training for their positions. Although the collegial group is 
egalitarian and democratic, areas or subjects are assigned to individual tech- 
nical experts. Decisions are made with thoroughness and deliberation rather 
than precision and efficiency. Individuals and cliques are heterogeneous and 
often in opposition, and the group proceeds by struggle, coalition, negotia- 
tion, and compromise. Members tend to hold veto power over one another, 
and because the collegium is ideally composed of equals, leaders are chosen 
by lot, oracle, election, or rotation for limited tenure. 


Bureaucracy consists of a series of offices organized hierarchically; that 


is, each lower office is controlled and supervised by a higher one. On the 
basis of “rational” values, a consistent system of legalistic norms and ab- 


1The author is now at the University of Wisconsin Research and Development Center 
for Cognitive Learning and gratefully acknowledges the influence of J. W. Getzel’s writ- 
ings and personal communications and comments on a draft manuscript by Nancy St. 
John, T. Douglas Hall, David Riesman, and Barak Rosenshine. 
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stract rules is formulated. Bureaucratic control is the impersonal applica- 
tion of these rules to particular cases. Bureaucracies tend to increase their 
power not only because they are rational and efficient, but because they 
accumulate knowledge through experience and record it in stores of docu- 
mentary material or “official secrets.” 


Collegial and bureaucratic forms of organization obviously conflict in 
theory and practice. Weber (1947 translation, p. 403) wrote at the begin- 
ning of this century: “. . . the history of modern administration in the 
Western World begins with the development of collegial bodies composed of 
technical experts... , [but] . . . the general tendency of bureaucracy .. - 
has become definitely victorious.” If this is so, it is necessary to examine 
the consequences of bureaucracy for institutions as well as for individual 
careers within institutions. 


Consequences of Bureaucracy 


Bureaucracy has several benefits for institutions, bureaucrats, and 
clients it serves. As Weber pointed out, the institution gains power through 
bureaucracy: it employs efficient means to accomplish its ends and thereby 
preserves itself. Bureaucracy also protects the bureaucrat and his client 
from idiosyncratic, personally-toned relationships by pre-determined cate- 
gories of interaction. The bureaucrat can move successfully through his 
career by following the “official rules” whether he personally endorses them 
or not, Similarly, the client is guaranteed egalitarian, if not desirable 
treatment. He may not like what he gets, but at least he is being treated 
like everyone else. Theoretically and, to a large extent, practically some 
undesirable aspects of institutional life such a favoritism and nepotism are 
minimized. 

There may be some serious and disturbing consequences of bureaucracy 
—not as much for the institution as for the society. As an institution gains 
additional security and power through the rationalization and efficiency 
of bureaucracy, it may emphasize the means which give it strength and 
stability rather than its original or ostensible ends. Moreover, institutions 
may lag behind changing society and culture in much the same way that 
culture lags behind technology as described by Ogburn (1952). If institu- 
tions can be thought of as persons, as Fromm (1955) analyzes society, one 
can characterize them with psychological concepts. Thus “sick institutions” 
may not cope with their problems because of the force of ingrained habits 
which are no longer adaptive. They may become “compulsively neurotic” 
in that institutional means do not accomplish professed ends. They may be 
blind to cultural or social problems in much the same way that neurosis 
blinds the individual. When challenged by external reality, instead of 
responding creatively, institutions may become more bureaucratic much 
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like the rigidity, authoritarianism, and dogmatism of the “defensive neu- 
rotic.” When its shortcomings become apparent, when actual means are 
seen not to correspond to purported ends, the institution may mold and 
develop an “image” through advertising and public relations as Boorstein 
(1962) has shown, in the way that the neurotic compensates or puts up & 
false front. 


This is not to say, however, that institutions wither away after these 
processes and consequences of bureaucracy. On the contrary, though 
bureaucracy is portrayed here as a kind of institutional sickness in an exag- 
gerated sense to bring out its salient features, it is quite obvious that major 
bureaucratic institutions are gaining much ground. Mills (1956), for exam- 
ple, described the enormous “power elite” of the military, government, and 
business. More specifically, Galbraith (1967) showed how sophisticated 
bureaucracy has evolved in the business corporation to almost guarantee 
profits in the “new industrial state.” White (1957) vividly analyzed the 
career stages of the “organization man” in business, and Schien (1968) has 
recounted the trauma and later accommodation to the corporate bureau- 
cracy by idealistic management-school graduates. These latter points bring 
us to our main concern here: the examination of the adjustment of individ- 
uals to careers in educational institutions viewed in the light of Weber's 
theory of bureaucracy. 


Getzels’s Model 


Getzels’s (1963) outlined a socio-psychological model “for the analysis 
of inter-relationships among cultures, institutions, and individuals” (see 
Figure 1). The model assumes that the individual and the institution are 
embedded in culture or sub-cultures each with its own ethos and values. 
Within an institution are expectations for various roles; and for any indi- 
vidual, need-dispositions with origins in personality may be specified. These 
factors interact in the determination of social behavior, and in a given 
analysis they may reinforce or conflict with one another. When there 
is reinforcement or correspondence among the factors, cultural good, institu- 
tional efficiency, and individual satisfaction are likely to result. The reverse 


Culture —— Ethos Value 
Soci aa Institution Role Expectation— — SW 
System. l 5 Behavior 
Individual Personality — Need-Disposition—— 


Value 


Culture Ethos 
Fig. 1. Getzels’s Model: Individual and Institution Embedded in Culture 
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is likely to occur when there is conflict between culture and institution, 
between institution and individual, and between culture and individual or 
conflict between and within roles and between differing perceptions of 
roles. This is an abbreviated summary of Getzels’s model, and the reader 
is referred to the paper cited which is rich with illustrations and summaries 
of empirical studies. 


The model may be useful not only for static treatment of social behavior, 
but for dynamic analysis of changing cultures, institutions, and individuals. 
As we have already seen, cultures may change faster than institutions, thus 
producing various strains and conflicts. Conversely, avant-garde institu- 
tions may advance more quickly than culture. But our concern here is for 
the individual and the institution. What institutional roles must the indi- 
vidual educator play as he moves from one career stage to another? From 
the perspective of role conflict, are there “individual lag” factors which 
prevent satisfaction and effectiveness in a new role? From the perspective 
of personality, are there factors which make for satisfaction and effectiveness 
in one role which cause the reverse in a succeeding role? 


Career Transitions 


Table 1 shows three similar career-role stages—preparation, teaching, 
and administration—for elementary and secondary school teachers and for 
college and university professors. Social, technical, and directive aspects of 
role were adapted from Moment’s (1967) analysis of transition in the 
career development of industrial managers. Neither the role aspects or 
stages in the table are meant to be inclusive; indeed, anyone who has taught 
at either level will recognize the over-simplification of these roles. Never- 
theless, the table is sufficient for our purposes of preliminary analysis here. 
It can be noted, for example, that there are a number of cases of drastic role 
transition; it may be more accurate to use Getzels’s (personal communica- 
tion) term “role reversal.” For example, in his first teaching job, the novice 
professor no longer takes courses; he teaches them. For the lower school 
teacher, the reversal may take place during practice teaching. This transi- 
tion will be one focus of subsequent analysis. Another case of role reversal 
is the beginning Ph.D. who has only done directed research under his 
advisor and who must being directing the research of his own students. 
Also, consider the professor who moves into administration. Will his schol- 
arly ways interfere with the execution of his new role responsibilities? Will 
being a notch above former colleagues create social tensions and conflicting 
perceptions of his new role? One would think so judging from the cynical 
characterizations of a dean one hears—‘“a mouse training to be a rat” or 
“an academician too smart to be president but too stupid to be on the 
faculty.” This will be the second focus of analysis. 
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TABLE 1 
Some Role Aspects and Stages in Educational Careers 
= pren 


Elementary and Secondary School Teacher: 


College Student Principal 


Social Fellow students Fellow teachers ? 
Technical | Taking college courses Teaching elementary subjects | Administration 
Maintaining discipline 


Establishing rapport 


Directive Teachers 
Professors (a) Principal Superintendent 
Self (b) Pupils Teachers 


College and University Professors 


Graduate Students President 


Social Fellow students Fellow professors ? 
Technical | Taking courses Teaching courses Administration 
Directed research Directing research 
Independent research 
Directive Trustees 
Professors (a) Department chairman Deans, 
Self (b) Students chairmen, 
professors 


Note—(a) and (b) under “Directive” refer respectively to who is directing and who is 
being directed. 


Let us turn to some empirical studies of the impact of these role transi- 
tions or reversals on the individual. Assume that the schools (Waller, 1932) 
and colleges (Paige, 1951) are bureaucratic to some extent, and, therefore, 
cxhibit some of the “symptoms” described earlier: hierarchical structure, 
resistance to change, precedence of means over ends, and the creation of a 
false image. Further assume that individuals in these institutions must to 
some extent play bureaucratic roles: impersonal, categorical, legalistic, and 
unresponsive to the personality needs of its clients or to changes in the 
cultural ethos. Given these assumptions, what are the characteristics of 
students who aspire to teach in the lower schools and what happens to them 
when they begin? 


Teacher Roles 


A variety of investigations illustrate that prospective teachers wish to 
associate and identify with children or to revivify their own childhood. 


413 


REVIEW OF EDUCATIONAL RESEARCH Vol. 40, No. 3 


Dilley (1957), using a pair-comparison, forced-choice scale, found that the 
distinguishing values of students in teacher education are: “desire for con- 
tacts with children,” “desire for contacts with adolescents,” and “desire for 
opportunities to help other people.” Goodenough, Fuller, and Olson (1946) 
used a specially constructed word-association test and found that student 
teachers responded with associations to childhood much more frequently 
than did liberal arts students. For example, prospective teachers made more 
references to childish activities (e.g., block-building, bubble-blowing) and 
to children (rattle-baby; poor-little baby) and chose more favorable adjec- 
tives to describe behavior (conduct-good; mind-bright) than did liberal arts 
students. Seagoe (1942), in a study of childhood experiences of students, 
reported that those who enter teaching enjoyed playing “school” as children 
much more than did students who are preparing for other professions. 
Stern (1963), in a review of factor-analysis studies, noted that empathy is 
consistently found to be a factor in students’ ratings of teachers and in 
teachers’ ratings of themselves. An example is Ryans’ (1960) Pattern X, 
described by the adjectives “warm,” “understanding,” and “friendly.” It is 
not unreasonable to conclude that prospective teachers have a need to 
identify and associate with children. 


Personality-Role Conflict 


These studies suggest that personality needs of education students (to 
establish rapport with children) conflict with the bureaucratic institutional 
role of the teacher. This conflict may result in less satisfaction and effective- 
ness in the beginnning teacher. This conflict apparently deflates the pro- 
fessional self-image during the first teaching experiences. Walberg (1968) 
found that in Chicago, self-ratings in the professional role of teacher de- 
clined sharply after practice teaching. The changes over time implied self- 
depreciation on intellectual mastery, lower expectations of pupil behavior 
and aspiration of self in the role of teacher, and less rapport with children in 
the class. Using similar research methods, these findings were later repli- 
cated on the whole in Boston and its suburbs (Walberg, Metzner, Todd, 
and Henry, 1968). Horowitz (1968) used Q sorts and found moves from a 
personal to an institutional conception of the teacher among practice teach- 
ers in Canada and the United States. Collectively, these studies confirm 
the hypothesis that personality-role conflict produces loss of self-esteem 
and movement toward the institutional role during practice teaching. Sim- 
ilar processes may occur with respect to intellectual needs. Levin, Hilton, 
and Leiderman (1957) found that persistence in teaching is higher in 
teachers who are less interested in books and their subject matter. Stern 
(1963) reported that scores on an intellectual-needs test were higher for 
teacher trainees than for teachers and that the most academically oriented 
trainees do not go on to become teachers, Personality-role conflicts may be 
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more severe for the very people who are potentially the best teachers. They 
may help to account for the high dropout rate of beginning teachers during 
the first few years of teaching; the average career expectancy is only about 
two years (Charters, 1964). 


Conflicting Perceptions of Roles 


Another kind of conflict that may occur in the neophyte in an institu- 
tion is that his perception of his role conflicts with the role expected of him 
by experienced professionals or the bureaucratic hierarchy. Jackson’s and 
Moscovici’s (1963) research showed that even before practice teaching, the 
teacher-to-be has a distinct perception of his future role, Using word com- 
pletions and other projective devices, these researchers found that his most 
salient need and concern is knowing how to maintain a pleasant classroom 
environment while performing as an authority figure. Biddle, Twyman, and 
Rankin (1962) found that education students differ from experienced 
teachers in their normative perception of the teacher's role as revealed by 
ratings of listed teaching behaviors. The education students approved of 
more teacher activities, more self-indulgence on the part of the teacher, and 
more tolerance of pupil behavior than did experienced teachers. Washburne 
(1957) interviewed graduate students in education who had experience in 
teaching and asked them to whom they were responsible as teachers. The 
students more often answered “myself” than “pupils,” “parents,” and “the 
community.” Least often mentioned was responsibility to “administrators.” 
Washburne interpreted these and his other findings as indicative of a con- 
flict of authority between teacher and the administrator arising from their 
differing perceptions of the teacher’s role. 


Culture and Institutional Role 


The interaction of culture and institution is another source of hypothe- 
ses in the Getzels model. Culture shapes institutions, at least their origins 
and development before they are fully bureaucratic. Thus, entry into 
schools in different cultures or sub-cultures should produce different effects 
on beginning teachers. Three studies already mentioned confirm this hy- 
pothesis, In Chicago’s inner city (Walberg, 1968) and in middle and upper- 
middle class suburban schools around Boston (Walberg et al. 1968), practice 
teachers declined on professional self-image ratings and democratic teach- 
ing attitude scales. However, the beginners in the affluent suburban schools 
rose on personally fulfilling aspects of self-concept and saw themselves as 
having better control of the class after practice teaching. Also, included in 
the Boston study, were comparable groups of education students tutoring 
slum children, many of whom were truants and “school behavior problems.” 
Before and after ratings on the same scales showed that, in contrast to the 
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suburban practice teachers, the tutors saw themselves as less good and stable 
in the professional role of the teacher, but more pupil-centered and less 
controlling and authoritarian. Horowitz’s (1968) study of beginning teach- 
ers in the United States and Canada showed sharp differences between 
national samples and between different sub-samples by region within na- 
tion. Beginners in Canada, particularly in one region, became more oriented 
to the institutional role as opposed to a personal role, when contrasted with 
beginners in the United States. Thus, these exploratory studies appear to 
support the hypothesis that schools in different cultures have different ef- 
yi on factors such as self-concept and teaching attitudes of the beginning 
teacher. 


Satisfaction and Effectiveness 


Consider some cross-sectional studies of teacher personality which bear 
upon role socialization and its effects on satisfaction and rated effectiveness. 
Guba, Jackson, and Bidwell (1959) tested veteran teachers on a personality- 
needs schedule. They found that teachers scored higher on deference, order, 
and endurance and lower on exhibitionism, dominance, and heterosexuality 
than college norms. They acknowledged that factors other than role social- 
ization (for example, age) may have accounted for this “meek” need struc- 
ture, that might be called here “bureaucratic personality.” (The differences 
were replicated later by Merrill, 1960, with better comparison groups.) 
They then turned to a more interesting comparison, veteran teachers and 
education students. They found that students in teachers’ colleges resembled 
veteran teachers on the personality scales much more than did education 
students at universities. However, comparisons of veteran teachers who had 
attended different kinds of training institutions revealed no differences 
between the groups of these scales. Either role socialization or attrition of 
deviant personalities apparently levels the needs brought about by the press 
of the training institutions, Finally, these investigators calculated scores for 
each teacher revealing his similarity to the mean score for all teachers. They 
found that, for a given school, the more closely the teachers resembled the 
typical teacher personality pattern, the less likely they were to feel satisfied, 


effective, and confident, but the more likely the principal was to regard 
them as effective. i 


Creativity and Innovation 


: The converse of these findings is that the more effective teachers, “ere- 
ative,” if you will, would be regarded as less effective by school administra- 
tors. To my knowledge, no one has yet satisfactorily defined or devised 
a measure of teaching effectiveness, but there are two studies of “creativity” 
in teachers. Jex (1963) reported research on a group of science teachers 
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attending a summer institute. After findings that teachers receiving higher 
professional ratings by administrators scored lower on Flanders general 
creativity scales, he and his colleagues developed a “teaching ingenuity 
test” consisting of free response items to questions such as: “List as many 
ways as possible a hole in the ground might be used in a school science 
activity.” He scored the answers for quantity and quality and correlated 
them with ratings of the teachers by their principals. Every single one of 
the ten principal ratings were negatively correlated with ingenuity. His 
measure correlated —.43 with rated quality of personality, and —.38 with 
overall teaching effectiveness. 


Walberg and Welch (1967) studied the comparative personality needs, 
values, and achievement of physics teachers who had volunteered to teach 
an innovative physics course for high school students. The teachers were 
courageous enough themselves to come from all parts of the country to 
Massachusetts to try using radically different and untested course materials. 
The teachers scored much higher on a physics achievement test than other 
groups of physics teachers. Their personality test scores showed that, com- 
pared to other science teachers, they had much higher theoretical and 
aesthetic values—both associated with creativity in previous research—and 
lower needs for abasement and higher needs for autonomy and hetero- 
sexuality—hardly the “meek need structure” of the typical teacher found 
in other studies. 


It may be concluded from this analysis and review of longitudinal 
and cross-sectional research on teacher role accommodation that the notions 
of the schools as bureaucracies as conceived by Weber and the Getzels 
model seem useful conceptual tools. The remainder of this paper is a 
further test of these ideas on some cross-sectional research on 180 college 


oe that Hemphill and Walberg (1967) carried out in New York 
tate. 


Roles of College Presidents 


It has been argued elsewhere and documented through logged time 
studies (Walberg, 1969) that the modern college president’s role is much 
More administrative, executive, and bureaucratic than scholarly and 
collegial. Another important aspect of his role, as revealed by how he 
Spends his time is external representation: “image-making,” fund raising, 
speech making, and entertaining. Yet many presidents arise from the faculty, 
Presumably a scholarly, collegial body. While experience in the collegium 
should give a president insight into the intricacies of professional relations, 
too much experience may lead to an over-identification with the collegial 
role which may interfere with his assumption of presidential responsibilities. 
Also, since subordinate administrative positons in higher education resemble 
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the presidency to some extent, they should lead to greater satisfaction and 
effectiveness in new roles. These generalizations serve to organize and 
interpret many of the important findings. Admittedly, after-the-fact, they 
may be employed as hypotheses in future work. 


Satisfaction 


There are a number of distinguishing characteristics of presidents who 
indicated on a questionnaire that they very often found their work highly 
satisfying. In contrast to presidents who were less satisfied, these men had 
longer experience in administration and in higher education. They more 
often had taught for a few years, but few had more than nine years or had 
no experience at all. They spent more time in bureaucratic activities, for 
example, administrative planning, reviewing reports, and authorizing ex- 
penditures and less time on collegial activities, teaching, meeting with stu- 
dents, working with the faculty on curriculum, and scholarship. They 
more often viewed their job as much like that of the chief executive of a 
business and even used modern business office methods more often, for 
example, dictating to a machine rather than to a secretary or writing in 
longhand. These factors suggest a renunciation of the collegial role and a 
full assumption of the burden of bureaucratic responsibilities. 


Effectiveness 


People who knew the work of the presidents in New York provided 
effectiveness ratings. The sample was split approximately in half on pooled 
ratings, and the two groups were compared. Teaching and administrative 
experience were found to predict effectiveness. An analysis was also made 
of each president’s two prior positions. Those who had been deans, college 
administrators, and department heads were much more likely to have high 
effectiveness ratings. Those who had held positions outside education were 
much less likely to be rated effective. The best predictor of effectiveness 
among the twenty categories of prior positions was that of the school super- 
intendent prior to the job held before the presidency. The effective 
presidents resembled the satisfied group (satisfaction and effectiveness were 
uncorrelated) in their assumption of bureaucratic responsibilities and their 
rejection of the collegial role. They spent more time on administration, had 
more subordinates reporting to them directly, and utilized efficient office 
procedures. They also spent more time on “image-making,” external rela- 
tions with alumni, parents of students, local government officials, commun- 
ity leaders, and professional groups. They gave more speeches per year 
and spent more time on fund raising. In summary, both the more satisfied 
and the more effective presidents tended to come up through the bureau- 
cratic hierarchy: they have acquired understanding of collegial roles but 
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firmly rejected them to assume the administrative role. The distinguishing 
general characteristic of the effective president is his willingness to assume 
not only bureaucratic responsibilities but also those of external representa- 
tion. There were additional findings on differing roles in different types of 
colleges and universities, (Hemphill and Walberg, 1967), but the present 
analysis served its purpose in showing some similarities in role transition in 
schools and universities. 


Conclusion 


In conclusion, I find in Weber’s theory of bureaucracy a useful 
construct for examining research on educational role transition. More- 
over, Getzels’s socio-psychological model of culture, institution, and the 
individual serves as an elegant analytical tool as Getzels and his colleagues 
have shown in the static analysis of social behavior and in the dynamic 
case insofar as it has been employed here. If the model is useful in con- 
ceptualizing research on such diverse groups as school teachers and college 
presidents, it promises to be useful for the analysis of other educational roles. 
Of course, much more analysis needs to be done. Also, our interest here 
has been theoretical rather than methodological. Many of the studies 
reviewed could be improved by better operationalizing the measures; 
longitudinal designs would be much more convincing than the cross- 
sectional studies cited here in a few cases, and true experiments with 
random assignments of persons to groups would be a radical departure from 
research in this area. Weber’s theory also helps to explain why it has not 
been carried out. Schools and colleges as bureaucracies do resist change and 
innovation and abhor research and evaluation—at least on themselves. 
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ART EDUCATION FOR THE YOUNG CHILD 


Marvin Grossman 


University of South Florida 


An examination of the history of instruction in art 
education and of a number of individual kindergarten and elementary art 
education texts reveals that there are many basic teaching methods which 
can be roughly grouped into two broad teaching philosophies or orientations. 
The first includes methods based on the philosophy that artistic abilities 
are inborn and if the natural growth processes are allowed to mature, the 
young artist’s ability will unfold. This teaching philosophy implies that 
society imposes standards that may inhibit the child’s natural artistic 
development. Generally, this philosophy dictates that the responsibility of 
the art teacher is to provide an environment that does not interfere with 
the child’s expressive abilities. 

Art education texts by Kellogg and O'Dell (1967), Bland (1960), 
and Hoover (1961) contain descriptions of art instructional methods for 
young children based on this maturational philosophy. Kellogg and O'Dell 
(1967, p. 17) wrote: 


Children who are left alone to draw what they like develop a 
store of knowledge which enables them to reach their final stage 
of self-taught art. From that point they may develop into gifted 
artists, unspoiled. Most children, however, lose interest in drawing 
after the first few years of school because they are not given this 
chance to develop freely. 


A sccond approach is based on a philosophy that art is basically 
a social and human enterprise and is given direction by man’s interaction 
with his environment. With this approach, emphasis is placed on the 
need for environmental experiences and direct teaching if a child is to 
develop artistically. According to this orientation, artistic development 
depends on the child’s experiences. 

Although theories that emphasize the maturational approach best 
exemplify the prevalent practices in art education for kindergarten children, 
some art educators now tend to favor and recommend more systematic 
curriculum development. 
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The Importance of Early Educational Experiences 


Psychologists and educators (Bloom, 1964; Bruner, 1960; Hunt, 1964) 
have recently stressed the importance of early educational stimulation. 
Gibson (1966, p. 26) an authority in the area of perceptual development, 
wrote: 


In the course of studying perceptual development and writing a 
book about it, I have become more and more convinced that an im- 
portant part of perceptual learning, grasping the distinctive fea- 
tures of objects and the invariants of events, goes on very early in 
life. 


She added that when a child reaches school age he has developed a style 
of perceptual learning. 


In Toward a Theory of Instruction, Bruner (1960, p. 29) wrote: 
. unless certain basic skills are mastered, later, more elaborated ones 
become increasingly out of reach.” An art educator, Salome (1966, p. 28) 
expressed a similar concern: “a wealth of training and experience contri- 
bute to an artist’s ability to perceive subtle visual relationships and 
transmit them into aesthetic forms.” He suggested that without special 
instruction, young children fail to develop fully their visual perceptual 
skills and that such deficiencies influence the way children respond to 
and organize stimuli. 


Research by such psychologists as Gibson and Gibson (1955), Harris 
(1963), and Wiktin, et al. (1962) supports the theory that learning is an 
important variable in the development of perceptual functioning. More 
pertinent to art education are studies by Nelson and Flannery (1967), 
Lewis and Livson (1967), and Salome (1964). These studies indicate that 
some types of learning experiences are more effective than others in de- 
veloping children’s abilities to handle visual information in their draw- 
ings. Nelson and Flannery exposed groups of six- and seven-year-olds to 
six different types of drawing instruction. The effectiveness of the dif- 


ferent types of instruction was assessed by having the children draw a 
lozenge in eight successive trials, 


The children were assigned to six equal-sized groups on the basis of 
group intelligence scores and a sample reproduction of the lozenge. Each 
group received a different treatment: in group one, border characteristics 
were emphasized; the children were asked to follow the border of a line 
drawing of the lozenge with their finger. In group two the children cut 
the lozenge out of a stimulus sheet; attention to shape was emphasized. 
In group three, the children were helped to make rudimentary analysis of 
the lozenge in terms of proportional characteristics. In group four, each 
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child was asked to criticize his product after cach drawing trial using 
the example as a referent. Group five was simply asked to draw the lozenge 
as it appeared on the model sheet: this was a control for practice. The 
children in group six made copies of their first drawings, the only one 
made while looking at the model. While most of the instruction groups im- 
proved, the greatest and most reliable improvement occurred in group two. 
The instructions to group two emphasized the shape of the lozenge, by 
having the children cut out the lozenge from a stimulus. Although the 
authors did not mention it, this strategy would emphasize the border of the 
lozenge. When the children cut out the shape, they would have to attend 
to the edges. Generally, the children reacted differently to the various types 
of instruction; this difference suggests that the most effective instructional 
procedure might include a variety of analytic procedures. 

A study of the development of the child’s ability to represent spatial 
concepts was conducted by Lewis and Livson (1967). They found that 
children from grades one through six discovered progressively adequate 
means of depicting three-dimensional objects. When the youngest children 
were asked to draw a three-dimensional geometric figure such as a cube, 
they commonly drew a square. Children a bit older produced a great vari- 
ety of drawings that showed several sides of the figure in one plane. The 
next oldest group drew several sides of the figure in space but incorrectly 
related. Finally, the oldest children began to represent accurately the 
spatial relationships in one- or two-point perspective. The authors reported 
that the children’s abilities to make judgments of the adequacy of a parti- 
cular graphic equivalent seems to partially account for their developing 
ability to depict spatial concepts. They indicated that tasks might be 
designed to speed up a child’s development by presenting him with drawing 
problems that require him to search for and judge the adequacy of two-di- 
mensional equivalents of three-dimensional forms. Lewis and Livson 
concluded that the development of means of depicting three-dimensional 
objects within the limits of a two-dimensional medium can be viewed as an 
individual’s response to a particular task, and that instruction and environ- 
mental influences can account for change. 

Salome (1965) investigated the effects of perceptual training on 
fourth- and fifth-grade children’s drawings. In an experimental and control- 
group design, the experimentals were given training that directed their 
attention to visual cues located in the contours of models of a lamp, truck, 
and armadillo; the control groups received conventional instruction in 
drawing the same stimulus objects presented in the experimental classes. 
The drawings were analyzed on a fifteen-point rating scale composed of 
three variables: (1) The degree to which the drawings represented the 
stimulus objects based on the amount of information included in the draw- 
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ings; (2) closure-clarity judgments were based upon the extent to which 
the form and its component parts, relevant to the object, were enclosed 
by lines; and (3) proportion-judgments were based on the correspondents of 
the proportions of the stimulus object to the drawing. 

The results of the training in Salome’s study of fourth graders were 
inconclusive. Examining the individual drawings of each object, only the 
experimental group ratings of the truck drawings showed a significant 
improvement. When the drawing scores of all the objects were combined, a 
significant difference between control and experimental groups appeared in 
favor of the perceptual training group, suggesting a cumulative treatment 
effect. The fifth-grade results were more conclusive: significant differ- 
ences in favor of the experimental treatment were reported for individual 
and combined drawings. Salome concluded that perceptual training relevant 
to the utilization of visual cues located along contour lines does increase 
the amount of visual information fifth-grade children include in their 
drawings of visual objects. These studies and investigations by Dubin 
(1946), Douglas and Schwartz (1967), and others seem to strongly indicate 
that young children’s artistic and perceptual abilities can be influenced 
by instruction. 


An Art Instructional Strategy for Young Children 


A number of authors in art education and psychology suggested di- 
rections for a structured developmental art program for young children. 
In trying to hasten the development of preschool children’s drawing abilities, 
Dubin (1946) used a verbally oriented training method. She described a 
training program in which the children were encouraged to elaborate on 
their initial ideas by asking questions about their pictures. The conclusion 
of her study was that children’s drawing abilities could be developed 


without any negative effects on the spontaneity and creative aspects of 
their artistic behavior. 


Another investigation exploring the use of language to develop more 
complete visual concepts was reported by Douglas and Schwartz (1967). 
They examined the kinds of ideas about art that four-year-olds are able to 
comprehend and use in their observations of ceramic works and in their own 
work with clay. Basic art ideas were presented in a manner appropriate 
for young children. Four of the basic ideas were: 1) art is a means of 
non-verbal communication; 2) the art product is the result of the artist’s 
idea; 3) the artist uses what he sees, thinks, and feels to create art. 
4) there is a great variety of materials available to the contemporary 
artist. The children studied by Douglas and Schwartz were shown profes- 
sional ceramic pieces, and the teachers used these pieces to point out 
the art ideas illustrated in these works of art. The teacher encouraged 
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and tried to elicit verbal observations from the children about the ceramic 
pieces. At the conclusion of the study the children were able to comprehend 
and interpret these art ideas in their own clay products. They were also 
able to describe and discuss ceramic works in an art context. 


There seems to be little disagreement that more elaborate verbal and 
visual concepts help children develop more complex art expressions. McFee 
(1961, p. 201) pointed out that when children are motivated to look for 
relationships and patterns, variations in form, line, color, and texture, their 
perceptual skills should develop. 


Wachowiak and Ramsay (1965, pp. 25-27) in their text, Emphasis: Art, 
suggested a method of helping children develop more elaborate perceptions. 
They posited that the teacher should build art experiences around things 
that can be touched, studied, explored, understood, and expressed. They 
added that the child should be encouraged to draw directly from nature and, 
as a consequence, this sensitive observation can be the basis for a creative 
interpretation of the world around him. 


Research by Torrance recognized the importance of manipulation of 
objects in creative and inventive activities. He reported (1963, pp. 110-117): 


. it seems clear that in the tasks permitting manipulation of 
objects, the degree of manipulation significantly affects the number 
and flexibility of responses . . . . The results suggest that when we 
attempt to evoke inventive responses, subjects should be encouraged 
to manipulate the objects involved. They also suggest the need for 
devising means whereby children can imaginatively manipulate 
ideas and relationships where manual manipulation is not possible. 


Harris (1963) cited two studies that affirm the influence of kines- 
thetic exploration on cognition. Mott (1945) with children, and Geck 
(1947) with college students showed that by adding kinesthetic experience 
to visual and auditory impressions the quality of the subject’s drawings 
were improved. Mott had the children exercise parts of their bodies as 
a group “game” before drawing the human figure. As the children went 
through the motor activities, they verbalized their movements, e.g., “this 
is my head, I nod it,” etc. Drawings made after these activities showed 
that the exercised part was not only more likely to be included but that 
it was drawn with more care for details. Geck’s study emphasized specific 
tactual and kinesthetic experiences by having the students manually 
explore a modeled human head before sketching it. 


The studies reviewed in this paper suggest that an art-teaching strategy 
for young children should develop children’s cognitive and sensory ex- 
ploration abilities. It seems that an effective way of developing young 
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children’s artistic expressive abilities is to provide them with real and 
immediate objects or experiences, and teach them to explore these objects or 
experiences through all their senses. 

I conducted two very similar 12-week studies with kindergarten chil- 
dren to explore the effectiveness of an art-teaching strategy based on 
cognitive and sensory exploration.’ It was hypothesized that this teaching 
method would influence the rate and direction of kindergarteners’ aesthetic, 
creative, and visual developmental growth as indicated in their visual 
expressions. 


The general pattern of the results derived from the data indicated 
that the experimental program was consistently more effective in altering 
the artistic responses of the five-year-olds. Although the results in the 
first study were not always statistically significant, the experimental 
groups consistently had higher mean scores than the control treatments 
for all measures. The statistically significant results of the second 
study helped confirm the positive effects of the experimental program on 
the children’s aesthetic, creative, and visual developmental growth. In 
addition to the significant statistical differences, qualitative differences in 
the products of the experimental group children were also observed. The 
children in the experimental groups included a greater amount of costume 
and body-part details in their drawings of a clown. Their drawings were 
also notably larger, and they were able to use the compositional space 
more effectively. When their clown drawings were rated by artists into 
eight categories, the experimental group’s drawings indicated a more 


PEAN use of the expressive elements such as composition, line, and use 
of color, 


In conclusion, there are strong indications that a developmental art 
program, stressing cognitive and sensory exploration can increase kinder- 
garten children’s abilities to include more visual information in their 
drawings. The directions pointed out in this paper indicate the importance 
of continued research in art education to determine the full implications and 
potential of developmental instruction in art for children. 
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The purpose of this paper is to review some 
of the recent literature and research involved in the development of 
imitative learning and related concepts. Throughout this paper, for the 
purposes of clarity and to provide a central point of semantic reference, all 
such phenomena will be collectively referred to as imitation. 

Research on imitation is characterized by a lack of continuity and a 
lack of synthesis into a unified, comprehensive, theoretical structure. The 
concept, in some form, has been investigated by child psychologists, social 
psychologists, counseling psychologists, physiological and perceptual psy- 
chologists, psychotherapists, personality theorists, learning theorists, teachers, 
and industrial psychologists. Many of these researchers found similar 
results in their work, but they used different constructs and terminologies 
to interpret their findings. 


Social Learning and Imitation 


Perhaps the most significant development of concepts involved with 
the acquisition of behavior through imitation is attributable to research 
efforts in social learning areas. Miller and Dollard (1941) presented a 
model for “imitative learning” which still stands today as a standard 
in social learning theory and research, The Miller and Dollard model 
states that the initial imitative act in imitative learning happens by 
chance and can only be reinforced if some drive is reduced after the 
execution of the response, which in turn strengthens the imitative response. 
This theory is primarily concerned with four factors: drive, response, cue, 
and reward. Miller and Dollard stated that three submechanisms account 
for imitation: (a) same behavior, (b) matched-dependent behavior, and 
(c) copying. 

Maccoby (1959) discussed imitation as instrumental learning in 
infant speech behavior. She suggested training in imitation by first 
imitating the child, then encouraging the child to do the same. All role 
taking, according to Maccoby, is imitative behavior; a child acquires 
imitative behavior from two sources—those who control the resources 
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the child needs and those who are in close contact with him. She added 
that much imitative learning is covert, an important issue which I discuss 
later in the section on “vicarious learning.” 

No discussion on the topic of imitation learning would be complete 
without careful consideration of the work of Albert Bandura and his 
associates. Bandura probably published more research on this topic than 
any other single author. Most of his work was done with social behavior 
and with children as subjects. Two articles (Bandura, Ross & Ross, 1961; 
1963a) about the transmission of aggression in children via film-mediated 
models have become classics in the social learning literature. In both 
of these studies the authors showed that aggressive behavior could be 
transmitted to children through filmed models and that this type of 
behavior was generalized to situations other than the modeled situation. 
The Ss imitated not only the aggressive behavior, but also the verbal 
behavior and physical mannerisms of the model. 

In a study investigating the influence of reinforcement contingencies 
(Bandura, 1965b), the consequence of modeled aggressive behavior was 
shown to be a determinant of performance. Punishment consequences 
decreased the observer’s performance; reward or lack of consequence in- 
creased imitation. After each group of Ss was exposed to one of the three 
modeled consequences, the three groups were offered strong incentives 
for performing the modeled aggressive behavior. Incentives removed 
differences in performance between the two groups. These results suggested 
to Bandura that a difference exists between the acquisition of behavior and 
the performance of behavior. Acquisition may occur due to contiguity 
(association), but performance may occur as a result of reinforcements 
administered to the performer. 

Bandura and Mischel (1965) used live and symbolic (verbal) models 
to modify “delay-in-reward” behavior. Ss who showed a preference for 
immediate but smaller rewards learned to delay reward in order to receive 
larger rewards. Both types of models (live and symbolic) were effective 
in altering behavior; however, in a one-month follow-up those Ss exposed 
to a symbolic model showed less retention and generalization of delay-in- 
reward behavior than those exposed to a live model. 

In an earlier study on the effects of reward, Bandura, Ross and Ross 
(1963b) showed that the behavior of a successful model was imitated and 
generalized, but the behavior of an unsuccessful model was not. The 
punishment or reward for aggressive modeled behavior determined the 
amount of imitative learning. The authors used the concept of vicarious 
reinforcement in suggesting that the observer acquires conditioned emotional 
responses even though he receives no aversive stimulation himself. The 
multiplicity of terminology and concepts employed by social learning 
researchers requires the reader’s close attention. 
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There are several comprehensive review publications available. In 
a paper written for the Nebraska Symposium on Motivation, Bandura 
(1962) discussed the development of imitation learning and his own re- 
search to that date. Bandura and Walters (1963) related imitation to the 
general areas of social learning and personality development. In a chapter 
written for a text on behavior modification, Bandura (1965a) reviewed 
his research findings and outlined his theory of imitation. In that chapter, 
he introduced his concept of no-trial learning and three effects of modeling 
influences: (a) the modeling effect, (b) the inhibitory and disinhibitory 
effects, and (c) the response facilitation effect. The first effect concerns 
the acquisition of new responses, the second concerns the strengthening 
or weakening of inhibitory responses already possessed by the observer, and 
the third concerns eliciting previously learned responses which resemble 
the responses of the model. In a review article Wotke and Brown (1967) 
discussed the general topics of social reinforcement, observational learning 
or imitation, and learning of aggression. Also included in this review were 
sections on the effects of social isolation on reinforcer effectiveness, parents 
and peers as social reinforcing agents, self-reinforcement, and social punish- 
ment. An excellent review of research on imitative behavior by Flanders 
(1968) provides a systematic survey of the experimental and theoretical 
literature on imitation. This review analyzed experimental designs used 
in recent imitation research and provides a discussion on the variables 
usually encountered in such studies. Particular attention was paid to the 
works of Bandura et al. A text by Bandura (1969a) on Principles of 
Behavior Modification and a chapter in a text on socialization theory 
(Bandura, 1969b) provide perhaps the most recent explications of that 
author’s views on imitation. 


Bandura’s more recent writings bear closer scrutiny. In a chapter 
entitled “Modeling and Vicarious Processes,” Bandura (1969a) reviewed 
imitation theory and described his own current theoretical orientation. 
As before, he maintained his dualistic stance in imitation learning by 
separating acquisition and performance into two discreet events. This 
accompanied a rejection of operant analyses of imitation because of their 
dependency or reinforcement. Bandura firmly maintains that performance 
alone is dependent on reinforcement and that acquisition may occur through 
observation of modeled behavior which does not involve reinforcement 
administered to the model or to the observer. As further justification of 
the dichotomy between acquisition and performance, Bandura (1969a, p. 
133) cited acquisition or observational learning as involving SATAN Eno 
representational systems—an imaginal and a verbal one.” Through these 
systems, stimuli are “ . . . coded into images or words for memory repre- 
sentation and function as mediators for subsequent response retrieval and 
reproduction.” Performance is the subsequent transformation of these sym- 
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bolic codifications into motor behavior. Imitative learning according to this 
theory can occur without any behavioral evidence (no-trial learning). 

In another paper, Bandura (1969b) discussed imitation in the context 
of a “social-learning theory of identificatory processes.” 


Vicarious Learning 


Early research on vicarious learning dealt with the generalization 
of reinforcing effects from a model to an observer (Hill, 1960). This usage 
does not involve reinforcers delivered to the observer, only those to the 
model. This definition, however, is somewhat limited in light of more con- 
temporary usage. Early definitions such as Hill’s would better apply to 
vicarious reinforcement than to vicarious learning in general. The concept 
of vicarious reinforcement has itself been the subject of much research. 

Sechrest (1961) investigated the effect of a child observing verbal 
conditioning being administered to a model. Both positive and negative 
reinforcement were administered to the model; results showed that negative 
vicarious reinforcement produced a significant improvement in time scores 
on a subsequent puzzle task. Positive vicarious reinforcers, however, pro- 
duced no significant improvement on the same task. 

The effects of amount and schedule of vicarious reinforcement were 
also investigated. Smith and Marston (1965) used a tape-recorded model 
in which the delivery of responses in a verbal response class was manipulat- 
ed. The verbal reinforcer of “good” was issued at each response in the 
response class by the model Ss. The amount of responding in the response 
class by the model (and the amount of reinforcement by the Model E) varied 
from 20% to 80% of the total number of responses. The mean number of 
critical responses increased as a function of word-class size and word fre- 
quency. The word-classes, in order of size, were animals, weapons, and 
fabrics. The degree of vicarious reinforcement was found to be a function 
of the size of the word-class being reinforced. 

Lewis and Duncan (1958) conducted a study to determine if the 
observation of partial reinforcement would yield partial vicarious rein- 
forcement. This was done using a slot machine lever-pulling task which 
yielded token rewards. Both of the S groups (those who watched the model’s 
performance and those who simply heard verbal descriptions of the model’s 
performance) had as much success in acquisition as the third group which 
actually operated the machine in acquisition. All three groups took the same 
amount of time in extinction. 

Resistance to extinction is an important issue in any analysis of 
the effects of conditioning. Marston (1964) investigated vicarious extinction 
after vicarious acquisition. In the study reported by Marston, tape-record 
models were verbally reinforced for human nouns with “good.” Reinforce- 
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ment was 100%. Ss underwent 10 operant trials, 50 vicarious acquisition 
trials and 50 vicarious extinction trials. Two variables were examined 
during vicarious extinction: (a) the effect of high vs. low rate of correct 
responses during observation of extinction tapes and (b) vicarious rein- 
forcement after responses, i.e., during extinction, does the reinforcement 
or nonreinforcement of the decreasing number of responses have any differ- 
ential effect? Results showed that vicarious extinction after acquisition was 
related to the rate of observed correct responses, but not to the continuance 
of vicarious reinforcement during extinction responses. The response rates 
after extinction for all Ss were above operant level. 

There is a considerable amount of evidence to support the contention 
that human behavior can be modified through the direct application of 
reinforcement to behavior. There is also a growing amount of data to 
support the thesis that reinforcement may not have to be direct to the 
learner; it may have the same effect on the learner if he merely observes 
reinforcement being given to another. The comparative dynamics of the two 
methods of reinforcement (direct vs. vicarious) is a logical point of interest. 
The effects of these two methods received some attention in recent research. 
Kanfer and Marston (1963) simulated a group situation on a tape record- 
ing. Verbal conditioning was effected by both vicarious reinforcement (VR) 
and direct reinforcement (DR). VR and DR both resulted in significant 
learning, but DR added nothing to VR. Both were equally effective and 
differences in extinction were attributable to individual differences in 
acquisition only. 

Kanfer (1965) summarized the results of his vicarious learning 
studies and concluded: (a) “observation of another person’s behavior and 
subsequent reinforcement coupled with the subject’s covert rehearsal of 
the behavior may represent an important way in which human beings 
can learn” and (b) “It may be possible for the human to reinforce his 
own behavior in the absence of external feedback.” This last conclusion 
implies that incorrect behavior may be self-reinforcing as in the case 
of deviant or pathological behavior. 

Phillips (1968a & 1968b) further investigated the Kanfer and Marston 
(1963) conclusions and provided interesting and conflicting results on 
the VR-DR question. Phillips (1968b) attempted to compensate for a 
weakness in the Kanfer and Marston study by adding a control group 
which experienced the model but received neither direct nor vicarious 
reinforcement. He used the critical response class of human words and 
used tape-recorded models to compare VR, DR, and no reinforcement (NR). 
The results showed that Ss in the two experimental groups and in the 
control group (NR) all increased significantly in critical responses above 
the base rate, There were no significant differences in acquisition between 
the VR, DR, and NR groups. This study actually consisted of two parts; 
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in the first, a 30% reinforcement schedule was used and in the second, a 
60% reinforcement schedule was used. Differences in reinforcement 
schedules had no differential effects on response rates. Phillips concluded 
that reinforcement (VR and DR) is not responsible for increases in response 
rates but that increases may be brought about strictly through exposure to a 
nonreinforced model. These findings and their resultant implications are 
quite incompatible with other available research evidence. These results 
certainly contradict the Kanfer and Marston (1963) findings and also 
conflict with Bandura’s theory in that “performance” of imitative behavior 
is generally assumed to be contingent on that behavior being reinforced. 
Phillips (1968a) also compared VR, DR, and NR in another study. A 
ten-trial base rate was taken for each subject before one of the three 
treatments was initiated. The results of this study showed that DR was 
more effective in increasing response rates than was VR or NR. The use 
of base rate instead of the NR control group method seems to be largely 
responsible for the discrepancy between this and the previous study. 


Self-Reinforcement 


Self-reinforcement is generally concerned with the self-management 
of reinforcement contingencies administered to oneself by oneself. Kanfer 
(1965) used this term to explain what the observer in imitation does 
when he vicariously learns. The observer covertly says “good, I guessed 
it,” etc. Kanfer believes humans can reinforce their own behavior without 
the use of external reinforcement. 

Bandura and his associates also worked with this variable. Bandura 
and Whalen (1966) exposed children to two types of models, one with high 
criterion for self-reward and another with low criterion for self-reward. 
Results showed that the children selected the reinforcement contingencies 
commensurate with their own achievements. To the authors this indicated 
that children do reinforce themselves and have specific criteria for doing so. 
Bandura and Kupers (1964) used a similar design and found that self- 
rt contingencies for self-reinforeement were stronger than external 
stimuli. 

Bandura, Grusec, & Menlove (1967) investigated social determinants 
of self-reinforcement when Ss were exposed to high and low nurturance 
adult models with conflicting or high standards for self-reward. Results 
showed that Ss who had observed models who showed conflicting standards 
of self-reward were more inclined to reward themselves for low achieve- 
ments than those who had observed only the adult model with high 
standards. Ss who experienced high nurturance were more inclined to 
accept lower standards. In another study on the effects of high vs. low 


nurturance, Madsen (1968) found no differences in imitation as a result 
of the two conditions. 
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During the last three years numerous studies appeared in which the 
effects of a host of dependent variables on self-rewarding behavior was 
investigated. A sample of these variables includes effects of rule and 
structure (Liebert & Allen, 1967); incentive level and method of trans- 
mission (Liebert & Ora, 1968); rehearsal trials (Kanfer & Duerfeldt, 1967); 
types of prior external reinforcement (Kanfer & Duerfeldt, 1968); and the 
relationship between self-reinforcement and reinforcement of another person 
(Marston, 1965; Marston & Smith, 1968). 


Variations in Imitation Research 
Models 


Surprisingly enough, the comparative effects of different types of 
models has not had a significant amount of investigation in imitation 
research. The kind of model used would not seem to have a differential 
effect on outcome since many types of model have been used with relatively 
comparable results. This includes the use of live, film mediated, tape record- 
ing mediated, verbal, and others. 


Bandura, Ross, and Ross (1963b) and McBrearty, Marston, and Kanfer 
(1961) have used live models with somewhat comparable success; live 
groups were used in the latter study. Bandura and Mischel (1965) found 
that Ss exposed to symbolic models showed less retention and generalization 
than those Ss exposed to a live model. 


Filmed models were extensively used by several authors. Bandura and 
his associates frequently used this medium (Bandura, Ross & Ross, 1963a) 
in the modeling of aggression for child subjects. Walters, Llewellyn-Thomas, 
and Acker (1962) also used film to model aggressive behavior. Walters, 
Leat, and Mezei (1963) conditioned resistance to temptation in children 
through the use of filmed models with reward and punishment response 
consequences. Although the majority of the models used in imitation 
research have been adult models, the use of film has not been limited to 
adults. Hicks (1965) used adult and peer models with differential results. 
Peers were found to have the most immediate effects; however, in a follow- 
up study, adults were found to have the most lasting effects on behavior 
change. In a slight variation from the film medium, Varenhorst (1964) 
used video-tape to investigate the effects of counselor traits in a “reinforce- 
ment counseling” setting. 

A model medium which received much attention in imitation research 
is that of the tape-recorded model. This has probably been due to its 
ease of construction and relatively low cost in comparison to filmed and 
other models. Another contributor to its popularity is the emphasis on 
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verbal conditioning in some behavior modification research. Kanfer and 
Marston (Kanfer, 1965; Marston, 1964; Kanfer & Marston, 1963) made 
extensive use of tape recordings, as did Phillips (1968a & 1968b). Wheeler 
and Caggiula (1966) also used tape recordings in their work with behavior 
contagion. All of these studies show tape recordings to be a useful and 
effective medium for model presentation. In studies using operant approaches 
(imitation reinforcement) to imitation learning (Sherman, 1965; Lovaas, 
1967; Peterson, 1968) the model is usually live since the experimenter 
often serves as the model and reinforcing agent. 


Studies which deal with model characteristics other than behavioral 
are rare in the literature, with the exception of those dealing with the 
sex of models. Krumboltz, Varenhorst, and Thoresen (1967) investigated 
nonverbal factors in models with information seeking behavior (ISB) as the 
dependent variable. In this study, models (model counselors) high in atten- 
tiveness and prestige were compared with models who were low in 
attentiveness and prestige. Attentiveness was defined by particular manner- 
isms (tone of voice, mode of expression, smiling) usually defined as attend- 
ing behaviors. The prestige of models was defined by the amount of 
training the counselor/model allegedly had, his style of dress, and the 
prestigiousness of the introduction of the S to the model (models were tape 
recorded). Results of this study indicated that high and low attentiveness 
and prestige had no differential effects on outcome. Models in general 
produced a greater frequency in variety of ISB than control procedures. 
Thoresen and Krumboltz (1968) investigated the differential effects of 
models exhibiting varying degrees of athletic ability and academic success on 
the acquisition of ISB. Results showed that different levels of athletic 
ability produced differential levels of ISB; however, models of varying 
academic success did not produce differential acquisition of ISB. 


Flanders (1968) provided a review of research concerned with “effects 
of antecedent characteristics” of models including effects of status, nurtur- 
ance, sex, realism of performance, affective relationship between model and 
observer, and effects of antecedent characteristics of observer's sex. 


Anxiety 


Haner and Whitney (1960) in their conditioning of Galvanic Skin 
Response (GSR) in human Ss gave a pretest (Taylor MAS) to determine 
the anxiety level of their Ss. Data showed that there was a positive relation- 
ship between the amount of measured anxiety and the amount of condition- 
ing. Walters, Marshall, and Shooter (1960) also found some evidence that 
moderate emotional arousal may increase imitation. Ss in this study were 
brought to high anxiety states by being put in isolation prior to treatment. 
In a study examining the effects of physiological emotional states on 
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imitative behavior (Schachter & Singer, 1962), injections of adrenalin 
were administered to Ss to induce emotional states. These authors also 
found that emotional arousal increased the amount of imitative behavior. 
Ss even imitated behavior which was inappropriate for their age. Bandura 
and Rosenthal (1966) found highly significant results in classically con- 
ditioning GSR through modeling. The S observed the model perform a 
pursuit rotor task during which a buzzer sounded and the model feigned 
receiving a shock. Observers witnessed ten presentations of the buzzer 
(conditioned stimulus) followed by the feigned experiencing of pain by the 
model. Observers were then given ten extinction trials on the pursuit rotor 
task with presentation of the CS alone. This treatment succeeded in trans- 
mitting anxiety to observers as measured by GSR. Walters and Amoroso 
(1967), in a study on the cognitive and emotional determinants of imitative 
behavior, used two levels of white noise administered during observation 
to investigate the arousal variable. Results indicated that no significant 
effects could be attributed to differential arousal levels. 


Self-Esteem 


Some research has been done (deCharms & Rosenbaum, 1960; Gefland, 
1962) which indicates that amount of self-esteem is related to an observer’s 
susceptibility to imitation. These studies showed that Ss with less self- 
esteem were more easily conditioned and more prone to imitative behavior. 
Rosenbaum and deCharms (1960) “vicariously instigated” hostility in Ss 
through a tape recorded mock group situation. Some Ss were given an 
opportunity to retaliate directly or hear retaliation. In a posttest for residual 
hostility, low self-esteem Ss in the third group (no retaliation) showed 
greater residual hostility. Bandura and Walters (1963) discussed the 
influence of reinforcement history on observer characteristics in imitation. 
These authors stated that persons who lack self-esteem, are incompetent, 
have reinforcement histories for matching responses, or are dependent are 
especially prone to imitate a successful model. Lanzetta and Kanareff (1959) 
also showed “incompetents” to be more prone to imitate models. 

Dependency, as a personality characteristic, has much in common with 
anxiety and lack of self-esteem as it affects the outcomes of imitation 
research. Studies showing both positive relationship between dependency 
and imitation (Bandura & Huston, 1961; Jakubezak & Walters, 1951) and 
no relationship (Marlow et al.. 1964) are in evidence. 


Sex 


The sex of the model as it affects imitation and the interaction 
effect of model and learner sex have received a considerable amount of 
attention in very recent research, especially in counseling applications. 
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Bandura, Ross and Ross (1963a) noticed significant interaction effects 
attributable to the sex of model and learner. In these studies male (children) 
Ss showed more imitative aggression than female. Male models were more 
effective for both male and female Ss than were female models. Female Ss 
with female models showed a great deal of partial imitative responses; 
however, male Ss with male models tended to imitate more of the total 
modeled aggressive behavior. These authors felt that the influence of 
models in promoting “social learning” is determined, in part, by the sex 
appropriateness of the model’s behavior. In general, however, females 
seem to perform fewer imitative responses, even with positive incentive as an 
added inducement. 


Rosenblith (1959) also found male models to be more effective than 
female. She postulated that the school setting (Ss were children) was 
responsible for this effect because it set up a deprivation in respect to 
male relationships and enhanced the reward value of males. Hicks (1965), 
also working with aggressive behavior in children, found that male peer and 
made adult models had more effect in shaping behavior. 


Schroeder (1964) and Thoresen (1964) found the effects for sex 
quite significant. In both of these studies using male models, male Ss 
had higher imitative response rates than female Ss. Krumboltz and Thoresen 
(1964) and Krumboltz and Schroeder (1965) further stressed that males 
and females responded differently to imitation treatments. This was also 
found to be true in group modeling situations. The authors suggested 
that the reason for the ineffectiveness of the male model for females may 
have been that the male modeled only male interests and concerns which 
may not have seemed relevant to females. They further suggested that the 
model may be more effective if it is more specific to the concerns and the 
sex of the S. In an attempt to further investigate the sex factor, Thoresen, 
Krumboltz and Varenhorst (1965) conducted a study aimed specifically at 
the sex variable. Four factors were investigated in a quasi-counseling 
situation: sex of the student (S), sex of the counselor, sex of the model 
student, and sex of the model counselor. A follow-up was conducted three 
weeks after treatment to determine the variety and frequency of after- 
treatment responding. The main effect for six was significant at the .05 
level for variety of responses and approached the .10 level for frequency 
of responses.-For male Ss the male student and the male counselor model 
were most effective. In general, female Ss did not perform as many after- 
treatment responses as did males. Male counselors were generally more 
effective. Thoresen, Krumboltz, and Varenhorst (1967) examined the 
effects of sex of counselor and model on client career exploration. Male and 
female counselors presented male and female models to male and female 
student clients. Models were fifteen-minute audio-tape presentations using 
frequency and variety of external information-seeking behavior as de- 
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pendent variables. Findings were that (a) male observers responded best 
when males were in all other roles (counselor, model counselor, and model 
student) and (b) female students responded best when the counselor was 
male and when the model was either all males or all females. In general, 
females did not respond to model reinforcement counseling as well as males, 
and in all cases male counselors seemed to be more effective. Part of the 
confusion on the sex variable seems to stem from the fact that studies such 
as those cited above are increased in complexity by an interaction between 
the sex of the subjects and the behaviors being modeled, which are usually 
sex typed. In such studies it is difficult to determine whether differential 
results are actually due to the sex of the subject or some other variable 
which derives from the sex of the subject (e.g, ability of the subject to 
identify with sexual role and status of the model). In studies where this 
interaction between the sex of the subject and the sex role implications of 
the model behavior are held to a minimum, differences in imitation do not 
seem to differ between the sexes (Rickard & Joubert, 1968). 


Operant Analysis of Imitation 


Most of the theory and research thus far reported was generated from 
social psychology or social learning sources and constituted either an 
instrumental learning interpretation of psychoanalytic concepts (Miller 
& Dollard, 1941) or a mediational interpretation relying heavily on vicarious 
processes (Bandura, 1969a & b); the exception is Mowrer’s (1960) combin- 
ation of classical conditioning and vicarious mediational process. The 
operant or Skinnerian researchers were somewhat slow in arriving on the 
imitation research scene. Operant studies on imitation have begun to 
appear, but an operant theory of imitative learning is not readily visible. 
Dereleri in this area seem promising and should be interesting to 
observe. 


Bandura (1969a & b) consistently criticized the operant analysis of 
imitation on the basis that it constituted a shaping procedure of unreason- 
able duration which would endanger the learner in many critical social 
learning situations. This criticism seems to imply that in every case of 
imitation learning under an operant analysis, the observer would have to 
be “shaped” into performing imitation itself and then further shaped into 
imitating the behavior in question by reinforcement of successive imitated 
approximations to that behavior. The reply to this analysis seems to be that 
imitation represents a class of behavior that is conditioned from birth in 
many cultures, including the American culture (Skinner, 1953). The 
nature of imitation is such that gross units of behavior can be learned, 
thus obviating the need for shaping from existent behavorial repertoires 
as is necessary in other operant procedures. Because reinforcement histories 
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for imitation do exist in humans in various degrees at different age and 
developmental levels, each human has some propensity or probability for 
imitative responding. 

That reinforcement histories for imitation are relevant to research on 
imitation can be easily documented in the social learning context as 
well. This can be seen in research on the effects of “set” (O’Connell, 
1965; Patterson, Littman & Brown, 1968), “nurturance” (Madsen, 1968), 
and “prior external reinforcement” (Kanfer & Duerfeldt, 1968) as well as 
in the works of Bandura and his associates (Bandura & Whalen, 1966). 
Bandura, Kanfer, and others noted that imitation appears to occur even 
when direct reinforcement to the model or the observer are absent. To 
explain this phenomenon, a host of mediational variables have been gen- 
erated including such hypotheses as symbolic and verbal coding (Bandura, 
Grusec & Menlove, 1966). It seems a reasonable alternative to suggest 
that such instances of nonreinforced imitation could be attributable to 
a reinforcement history for imitation which has considerable strength 
and does not require a continuous reinforcement schedule. 


Sherman’s (1965) study was based on previous research on rein- 
forcement methods with psychotics (Sherman, 1963) and imitation reinforce- 
ment with children (Baer & Sherman, 1964); he used an imitation re- 
inforcement procedure to establish verbal behavior in a psychotic mental 
patient who was mute for more than twenty years. Sherman used three mute 
psychotics as subjects, two of whom responded to shaping and fading 
treatments alone. For the third subject, verbal behavior was instituted 
through the shaping of imitative responses. This was accomplished through 
the experimenter’s modeling of desired responses and directly rewarding 
the subject with food and social reinforcers for imitation (of the E). This 
procedure also elicited verbal responses from the S that were similar to 
those being modeled, but which were not actually in the critical response 


class modeled by the experimenter; they were, however, reinforced when » 


emitted. After a verbal repertoire was established through imitation rein- 
forcement, shaping and fading procedures were applied, as with the other 


two Ss, to generalize responses and broaden the stimuli to which the subject 
would respond, 


: Similar procedures were employed using mentally retarded children as 
subjects (Baer, Peterson & Sherman, 1967; Lovaas, 1967; Peterson, 1968) 
with similar success, These studies all indicate that imitation may be con- 
trolled through the use of reinforcement contingencies. 


Generalized Imitation 


_ Several studies on generalized imitation followed the same general 
design, at least in part. The investigators principally used child subjects 
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including preschool children (Baer & Sherman, 1964; Brigham & Sherman, 
1968), retarded children (Baer, Peterson & Sherman, 1967; Peterson, 1968) 
and schizophrenic children (Lovaas, Berberich, Perloff & Schaeffer, 1966). 
In most of these studies the E served as the model and the response class 
was a verbal one including English, Russian, and Norwegian words. The 
general design of these experiments was to condition words by having the 
model say them and directly reinforce (food, candy, and social reinforce- 
ments) the observer for imitation. Other words, for which reinforcements 
were not given when imitated, were interspersed in the modeling procedure. 
In all cases it was found that those words which were modeled, but not 
reinforced if imitated, were also imitated, and they increased in frequency 
as did the reinforced words. After the modeled verbal behaviors were 
established in the observers, reinforcement contingencies were changed 
or extinction procedures were applied. This resulted in a decrease in 
imitation of both the reinforced and nonreinforced words. Reinstitution 
of reinforcement contingencies reestablished imitation of both reinforced 
and nonreinforced modeled words. This finding indicates that both word 
classes were under the control of the reinforcements administered to one 
class due to a generalized imitation effect. The same principle seems to 
apply to simple motor behaviors (Baer, Peterson & Sherman, 1967). The 
major implication from these studies is for the identification of response 
classes. A response class may be narrowly defined and limited by its 
topography or it may be broadly defined to include behaviors which are 
topographically different but in some other way similar. The latter definition 
would seem to apply to the studies cited above in which English and Russian 
words, for instance, were considered as two separate response classes; they 
were probably both members of one response class more appropriately 
defined as “words.” 


Imitation, Counseling, and Therapy 


Although the therapist has often been thought of as a source of 
reinforcements, little attention has been paid to the therapist as a source 
of behavioral repertoires (Bandura, 1965a). Most research and concern 
in this direction occurred in the past three to four years. Kanfer (1965) 
suggested that the therapist may act as a model for the patient or as a “self- 
stimulator” to the patient. Mowrer (1966) also discussed the therapist as 
a model and was probably a little more outspoken than many in saying 
that the therapist must provide a model for the client and encourage 
imitation of that model by the client. Marmor (1962) showed that the 
patient in therapy does model his behavior after that of the therapist. He 
suggested that the patient uses the therapist as a source of reinforcing 
stimuli by covertly asking himself “what would you want me to do?” 
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Discussions of models in counseling, as in therapy, take two directions, 
The first direction is toward the counselor as a model. This becomes an 
ethical issue involved with how much the counselor should project himself 
into the counseling relationship and to what degree he should set himself up 
as a behavorial model for another human being. The second direction is 
toward the use of imitation as a treatment mode and is concerned with the 
degree to which desired counselee behaviors can be acquired through the 
use of real or symbolic models, 


Shoben (1965) perceives the counselor as a model of the effective 
and self-fulfilling person and of generalized human decency. He further 
wrote that during counseling the client acquires a new ego-ideal or ideal 
self-concept which becomes a goal in counseling. Shoben acknowledged 
the moral implications of such a concept of counseling by saying that it 
would require that the counselor or therapist know himself well enough 
to avoid becoming a “demagogue” and to be worthy of accepting the 
responsibilities of his task. 


Krumboltz (1965, p. 19) emphasized ends over means by defining 
counseling as “whatever ethical activities a counselor undertakes in an 
effort to help the client engage in those types of behavior which will lead 
to a resolution of the client’s problems”—the client is “requesting ends, 
not means.” 


Counseling in School Settings 


The preponderance of research in the application of imitation theory 
to the counseling process came from John Krumboltz and his students. 
Four studies of considerable significance (Thoresen, 1964; Krumboltz & 
Thoresen, 1964; Schroeder, 1964; Krumboltz & Schroeder, 1965) are all 
variations of the same experimental design and use the same rationale 
and concepts. These will be discussed collectively and variations noted 
where necessary. Two main treatments were used in these studies: (a) 
“reinforcement counseling” consisting of a verbal conditioning situation 
in which agreement and approval were administered (verbally and by 
gesture) as reinforcement for performance of the desired verbal behavior 
and (b) “model-reinforcement counseling” in which the same verbal 
conditioning treatment was administered with the addition of two 15- 
minute tape-recorded model counseling sessions in which the same verbal 
response class was reinforced in the interaction between the model S$ and 


(ISB) which includes educational or vocational exploration behavior of 
any type. ISB may include verbalizations about the use of tests, interviews 
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with occupational representatives in the field, the whereabouts of occupa- 
tional information, the nature of college programs, and so on. Two 
50-minute sessions were held with each S, one week (or two weeks) apart. 
At the end of the first session a summary of ISB emitted by the S in the 
session was given by the counselor. At the onset of the second session the S 
was asked what ISB he had performed since the first session and this was 
also verbally reinforced by the counselor. Cues, in the form of questions and 
suggestions were supplied by the counselor in both treatments. Under the 
guise of an unrelated classroom task, questionnaires were filled out by Ss 
several weeks after treatment. The questionnaires gave information on the 
amount of actual ISB the Ss had performed subsequent to treatment. This 
ISB was labeled “external ISB” as opposed to “internal ISB” which was 
verbal ISB emitted during the two treatment sessions. One active and one 
inactive control group were used. Thoresen (1964) and Krumboltz and 
Thoresen (1964) varied their studies by adding a group treatment to each 
of the two previously mentioned treatments. In all four studies results were 
quite similar. Both treatments were found to be more effective than either 
of the control groups. Both treatments, however, appeared to be equally 
as effective for female Ss, whereas model-reinforcement counseling was 
more effective for male Ss. This was attributed to the fact that the models 
were male and the topic of the model counseling session centered around 
male concerns and interests. Female Ss in reinforcement counseling exceed- 
ed the control groups; males did not. Internal ISB was positively correlated 
with external ISB; however, internal ISB did not increase significantly above 
that of the control groups and external ISB did. In the two studies in which 
the investigators used group treatments, results showed group model- 
reinforcement counseling was more effective than individual model-rein- 
forcement counseling. 

Varenhorst (1964) also used ISB as a response class. She conducted 
a model-reinforcement counseling study designed to investigate the effect 
of attentiveness and prestige of counselors on their success in producing 
ISB. She found that a video model was effective in producing ISB. She also 
found that the Ss imitated the modeled behavior regardless of the amount 
of attentiveness or prestige exhibited by the counselor. 

Ryan also investigated model-reinforcement counseling; she used the 
counselor as the model instead of an external model (Ryan, 1965a & 1966). 
The counselor presented himself as a model through verbal “cues” which 
indicated the desired behavior. In this study, nonprofessional counselors 
(students) were used as counselors in a group setting after being briefed 
on counseling methodology. The response class (good study habits) was 
emitted more frequently. Results failed to reach significant levels, however. 
In a paper on the influence of cueing procedures on counseling, Ryan 
(1965b) suggested that counselors could be more effective by becoming more 
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aware of the cueing procedures they use (consciously or unconsciously) and 
by selecting and using the cueing pattern most likely to produce the desired 
behaviors. 

Thoresen, Krumboltz, and Varenhorst (1967) undertook further 
exploration of the sex variable which appeared to be the root of differential 
results in some of the Krumboltz et al. studies previously mentioned. In 
the 1967 study, they investigated the sex effect of model counselors and 
students (tape recorded), the counselor presenting the model, and the Ss’ 
acquisition of ISB. The design and procedure was essentially the same as 
other ISB studies and included a three week follow-up. Findings revealed 
that (a) model-reinforcement, on the average, was more effective than 
controlled procedures for male Ss but not for female Ss, (b) male Ss 
responded best when the modeled counselor and student and the live 
counselor were all males, (c) female Ss responded best with a male coun- 
selor, but only when the model counselor and student were either all male 
or all female—mixed sex models were not as effective. 

Thoresen and Krumboltz (1967) investigated the effects of reinforce- 
ment within a host of treatment settings including individual model- 
reinforcement, group model-reinforcement, individual reinforcement, group 
reinforcement, and individual and group controls. Sex of Ss was evenly 
divided and the dependent variable was again ISB; procedures were 
essentially the same as in other ISB studies by Krumboltz et al. Subjective 
ratings by clients after the study indicated no significant association between 
amount of ISB and the degree to which subjects felt they had been helped 
by counseling. A study investigating nonverbal factors in reinforcement 
counseling (Krumboltz, Varenhorst & Thoresen, 1967) used two video- 
taped model observation treatments. Video-tapes portrayed rehearsed 
counseling sessions in which the model student emitted ISB responses 
and the model counselor gave reinforcements and cues for ISB. On one 
video-tape the model counselor showed high attentiveness by smiling at 
and facing the model student, nodding, indicating pleasure with the con- 
versation, indicating enthusiasm through her tone and expression, and not 
exhibiting any distracting mannerisms. On the second tape the model 
counselor showed low attentiveness by not smiling, using a flat tone of 
voice, seldom looking at the student, and generally exhibiting distracting 
mannerisms such as looking at her watch. The prestige of the model 
counselor was also manipulated in that the video-taped counselor was 
introduced to Ss either as (a) a counselor with a great deal of experience 
who was very helpful, or (b) a counselor who was just completing her 
training and was still in the process of learning how to counsel. An analysis 
of variance failed to produce any significant differences in ISB as a result 
of either the alleged prestige of the model counselor or the degree of 
counselor attentiveness. Treatment groups did, however, produce a greater 
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frequency and variety of ISB than did cither active or inactive control 
procedures. 

Investigating similarity of model and client, Thoresen and Krumboltz 
(1968) constructed two sets of audio-taped models. One set modeled 
students at three levels of athletic success; the second set modeled students 
at three levels of academic success. Both sets were constructed around 
discussions on vocational planning activities. Subjects were eleventh-grade 
males. Results showed that different athletic models’ success levels produced 
significant differences and frequency of ISB—high success athletic models 
were the most effective for all students. Variations in academic success 
level, however, did not produce significant differences in ISB. 

It is unfortunate that the greater portion of the Krumboltz research 
was generated from the same design and used the same dependent variable 
(ISB) and that there was not more effort to use the model-reinforcement 
counseling technique to institute other behaviors relevant to the school 
counseling setting. There was little effort to investigate subject variables 
(aside from sex) by Krumboltz nor was there any attempt to use such 
techniques as “generalized imitation.” 

In general, very little research is being done on the effect of modeling 
procedures in clinical settings with normal subjects (or clients). There 
has, however, been some work using video-tapes for “self-confrontation” 
through play back in treatment (Danet, 1968). There were also some 
video-tape methods incorporated into hospital settings for the preparation 
of patients for treatment and for the training of psychiatric aides in 
behavorial techniques (Lee & Znachko, 1968). 


Treatment of Behavior Deficiencies 


The incorporation of imitation concepts into treatment modes seems 
to have accelerated recently in those areas which are concerned primarily 
with gross behavior deficiencies and to some extent neurotic fear reactions 
(phobias). Research on clinical applications of imitation was focused 
primarily on child subjects and was conducted frequently in institutional 
settings. 

Bernal, Duryee, Pruett, and Burns (1968) conducted an interesting 
behavior modification program aimed at remission of severe disciplinary 
problems in an eight-year-old emotionally disturbed boy. A chronic 
problem in the use of behavior modification techniques with disciplinary 
problems is the lack of complete environmental control in order to maintain 
contingencies necessary for modification of behavior. To gain some of 
this control and to insure that home environments do not work at cross 
purposes with treatment programs, parents are often incorporated into 
and aid in treatment. In the present study, the mother’s aid was enlisted 
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by training her in step-by-step procedures for interacting with her son. 
Mother-son interactions were video-taped as the mother practiced in- 
structions given to her. These video-tapes were viewed with the experiment- 
ers for appraisal of success. This modeling method resulted in a rapid 
reduction of the S’s abusive behavior within a few weeks. Twenty-three 
weeks after initiation of the program positive effects were still evident. 
Sarason (1968) dealt with disciplinary problems in what might be con- 
sidered normal children; he used observational learning techniques in 
treating juvenile delinquents. 

Perhaps the most impressive use of imitation techniques with severely 
disturbed children was provided by Lovaas (1967) in his treatment of child- 
hood schizophrenia. Lovaas has been associated with a treatment program 
which used a host of operant methodologies, only one of which was imitation 
training. The Ss selected for this project were severely impaired autistic 
children, most of whom had undergone previous traditional treatment and 
institutionalization. Since these subjects had all shown little or no imitative 
behavior at the onset of treatment, it was necessary to condition them to 
initiate behavior (operant shaping procedures were attempted but had no 
effect). The initial segment of Lovaas’s “verbal imitation training” pro- 
ceeded as follows: (a) Ss were reinforced for all vocalizations and for 
looking at the E’s mouth; (b) acquisition of temporal discrimination—S was 
reinforced only if response was within six seconds of a model’s vocalization 
(reinforcers were food); (c) same as (b) but the S had to match the model’s 
verbalization. In this step, as in other steps, a prompting procedure was 
used in which the model (E) aided the response by physically minapulat- 
ing the S’s head or manipulating his mouth, etc. The words were those 
with distinct visual components such as the letter m or open mouth vowels 
such as a as well as sounds which were frequently emitted by the child; 
(d) repeat of (c) with new sounds interspersed with sounds which had 
already been conditioned. The aim of this procedure was to increase dis- 
crimination. 

A procedure similar to that used by Sherman (1965) was used to 
determine the controlling effect of reinforcement. Reinforcement was 
changed from a response contingent to a time contingent schedule; the 
change resulted in a decrease of critical responding. Reinstatement of the 
response contingent schedule resulted in reinstatement of imitative behavior, 
thereby indicating that responses were under the control of reinforcements 
being administered. After imitative responding to English words was 
attained, Norwegian words were included in the modeling sequences but not 
reinforced; this change produced the generalized imitation effect previously 
discussed in this review. 

The second phase of the imitation training program shifted away 
from control by individual models to control in a larger environmental 
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context. This was carried out in three stages: (a) teaching Ss to identify 
and label common objects and behaviors, (b) increasing the use of abstract 
terms, and (c) eliciting the use of language in conversational speech. 

Nonverbal imitation was also conditioned in this program. Eighty 
tasks ranging from simple to complex were learned by essentially the 
same procedures as those used in verbal imitation, with the exception 
that Ss matched the model’s motor activity. Imitative tasks were in five 
areas: (a) personal hygiene and self-help, (b) games and following rules, 
(c) appropriate sex role behaviors, (d) drawing and printing, and (e) 
affectionate behavior. 

Although the generalization of verbal responses from those particular 
words imitatively conditioned was not what the investigators hoped it 
would be, that verbal behavior was established at all in these Ss appears 
to be a monumental accomplishment. 

Mental retardation. The mentally retarded child seems to be a 
likely subject for imitation reinforcement techniques. Peterson (1968), 
using one retarded child as his subject, conditioned a variety of imitative 
responses including motor responses. The S was taught to imitate a variety 
of modeled behaviors until training had reached a point where modeled 
behaviors were imitated on first presentation. The S was then able to 
maintain learned imitations without reinforcement when such imitations 
were interspersed among other new imitations which were reinforced 
(imitation generalization). The fact that a rather broad imitative response 
repertoire can come under the control of a relatively small number of 
reinforcement contingencies seems to have clear implications for training 
the retarded. The same processes were used by Baer, Peterson, and Sherman 
(1967) to establish generalized imitative behavior in three retarded Ss. 

Speech and language training. Using imitation reinforcement pro- 
cedures similar to those described above, Sherman (1965) was able to 
reinstate verbal behavior in a mute psychotic. The S was a 6l year-old 
woman who had been hospitalized for 37 years; she had exhibited 
complete mutism during 33 of those years. To establish a functional 
class of imitative behaviors, Sherman initiated an imitation reinforcement 
procedure; he began with the imitation of nonverbal responses. These 
nonverbal responses (standing up, sitting down, picking up a spoon, ete.) 
were behaviors under the verbal control of the E. The E modeled the 
behavior and accompanied it with a verbal command to perform the 
behavior. If the S imitated the behavior, the E rewarded the S with a bite 
of food and said “good”. If the S did not perform the imitative response, 
then the E sat down and read for one minute before resuming with the 
next trial. The nonverbal behaviors modeled by the E were gradually 
changed to behaviors which began to resemble verbal behaviors (e.g., 
blowing out a match, opening the mouth, and clearing the throat). After 
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40 sessions the E was able to elicit a host of modeled responses from the S 
by requesting “do this.’ The next step toward verbal imitation was to 
condition basic sounds such as hissing sounds, “ppph” and finally a long 
“ah” sound. By session 57 the E had imitatively conditioned the basic 
sounds necessary to chain together the word “food”. The S also began to 
consistently imitate other simple words modeled by the E. At this point 
verbal behavior at its basic level had been instituted; further treatment 
consisted of shaping and fading procedures to broaden the S’s verbal 
repertoire. 

Although the generalization to situations outside of the experiment 
still remains a problem in this type of procedure, the results, in light of 
the S’s clinical history, are quite dramatic. The procedures in this investi- 
gation are comparable to the procedures reported by Lovaas (1967) and 
Lovaas et al. (1966) in their work with schizophrenic or autistic children. 

Wilson and Walters (1966) compared model reinforced (operant), 
model nonreinforced (vicarious), and control procedures to modify speech 
in near-mute schizophrenics. Subjects were severely regressed and varied 
in age from 32 to 62 years. Pennies were used as reinforcers in the model- 
reinforcement treatment. After seven acquisition sessions both the 
model-reinforcement group and the model-nonreinforcement group showed 
increases in verbal behavior. Increases were greater in the model-reinforce- 
ment group. In session eight, the reinstatement of base line procedures 
resulted in a decrease in verbal responding in both the model-reinforce- 
ment and the model-nonreinforcement group; the greater decrease oc- 
curred in the model-reinforcement group. A follow-up showed that treatment 
effects did not generalize to the ward settings where the Ss were patients. 
The results of the study indicated that operant procedures show higher 
acquisition rates but that nonreinforced or vicarious procedures are more 
resistant to extinction. 

Avoidance behavior. Bandura and others conducted a series of studies 
incorporating desensitization (counter-conditioning) and modeling pro- 
cedures to extinguish avoidance or phobic behavior (Bandura, 1968). 
Bandura, Grusec, and Menlove (1967b) used 48 nursery school children 
and live models to attempt vicarious extinction of dog-avoidance behavior. 
All Ss had shown strong avoidance behavior on pretreatment tasks with 
a dog. Three treatment groups were used including: (a) a series of 
modeling sessions in which Ss, in a highly positive context, observed 
a model exhibit progressively stronger approach behaviors with a dog, (b) 
a series of modeling sessions in which the Ss observed the same modeled 
behaviors in a neutral context, and (c) a series of sessions in which Ss 
observed the dog in a positive context but in which no model performed 
approach behaviors. A posttest consisted of having Ss perform approac 
behaviors with the dog. Results showed that Ss in all treatments exhibited 
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significantly more approach behavior toward the dog. Subjects in the 
modeling-positive context treatment showed significantly more approach 
behavior than the other two treatment groups. 


In a similar study, Bandura and Menlove (1968) used filmed models 
to explore the effects of diversity of modeled stimuli on the degree of 
vicarious extinction. Fear of dogs was the dependent variable and two 
treatment groups were tested: (a) a group of nursery school children 
observed a filmed model display progressively closer interactions with a 
single dog and (b) a group of nursery children observed a variety of models 
performing the same progression of interactions with a variety of dogs. 
Results of a one month follow-up revealed that Ss in the multiple model 
and dog treatment were able to perform dog-approach behavior to a much 
greater extent than were those in the single model and dog group. Both 
treatment groups, at the termination of treatment, showed a significant 
reduction in avoidance behavior. It seems logical to hypothesize that the 
multiple model treatment was more successful due to a greater availability 
of stimuli for stimulus generalization effects. In comparing the results of 
this study with the earlier study, Bandura, Grusec and Menlove (1967b) 
noted that the single symbolic model was less effective than the single 
live model used in the previous study. The authors suggested that treatment 
might be best accomplished by using either a single live model or multiple 
symbolic models. 


The procedures described above have much in common with desensi- 
tization procedures, since they involve the presentation of increasingly 
threatening stimuli to the subject and simultaneous maintenance of a 
relaxed state in the subject by presenting stimuli in highly positive (relaxed) 
context. Bandura, Blanchard, and Ritter (1969) compared desensitization 
procedures to vicarious extinction procedures (modeling) for producing 
affective and attitudinal change. Prior to treatment, measures of attitude 
were administered to male and female Ss from a set of attitude scales and a 
Semantic Differential technique. An avoidance behavior test was also 
administered in which a graded series of 29 performance tasks consisting 
of increasingly threatening interactions with a four-foot king snake was 
performed by Ss. As a final pretest, Ss were matched and randomly assigned 
to one of four conditions. One condition employed standard systematic 
desensitization using deep relaxation. A second condition consisted of self- 
administered symbolic modeling in which Ss observed a film showing 
increasingly more threatening interactions with a snake. The third condition 
employed graduated live modeling with guided participation. A control 
group constituted the fourth condition. 


Results revealed that all three treatment groups had significant reduc- 
tions in avoidance behavior when compared with the control group. 
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Symbolic modeling (with relaxation) and systematic desensitization proved 
equally effective; the modeling with guided participation treatment (contact 
desensitization) showed the greatest reduction in snake avoidance behavior. 
During posttreatment trials a second, unfamiliar snake was introduced in 
order to obtain an indication of the generalization effects. These were also 
positive, but not for all Ss. Fifty-five percent of the guided participation Ss 
were able to generalize, and 100% of the symbolic modeling and 0% of the 
systematic desensitization Ss were able to generalize. The greater success 
of Ss in the symbolic modeling treatment was considered due to their 
viewing a wider variety of stimuli. Correlations between behavioral and 
attitudinal changes were moderately high. 


Other experimenters used variations of this procedure with varying 
success. Hill, Liebert, and Mott (1968) used filmed models to reduce dog 
phobias in preschool children: one modeling high fear of snakes and the 
other modeling low fear of snakes. High and low fear subjects were exposed 
to both conditions. Results were essentially positive for high fear Ss observ- 
ing low fear models. Other results were inconclusive due to design weak- 
nesses, 


Although comparatively little research was done on modeling pro- 
cedures for extinguishing aversive behavior, the few studies to date seem 
to be extremely promising for the development of a highly effective 
procedure, both with children and adults. The “guided participation” 
procedure described by Bandura, Blanchard and Ritter (1969) is essentially 
the same procedure described as contact desensitization (Ritter, 1968); the 
only difference is in the theoretical assumptions undergirding the two 
procedures. The development of further procedures along this line should 
help clarify the theoretical constructs involved. Differential theory to 


Support essentially identical treatment would seem to be an artificial state 
of the art. 


Group Counseling and Therapy 


When theories and techniques of counseling and therapy evolve, 
applications in group procedures for those theories usually follow weakly 
and slowly and become stepchildren to the singular modes. The area of 
behavior therapy is no exception to this, as the paucity of research in reviews 
of group behavior therapy reveals (Krumboltz, 1968). In the ISB studies 
by Krumboltz and his associates, group procedures were occasionally 
employed using the model-reinforcement paradigm (Krumboltz & Thoresen, 
1964). In these studies tape recordings which modeled the desired reinforce- 
ment contingency were usually employed. A counselor present in the group 
also dispersed reinforcements to the model and directly to group participants 
for imitation. Ryan (1965a; 1966) also used model-reinforcement counsel- 
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ing in group settings, but she used members within the group or the 
counselor himself as the model. 

In an excellent study, Hansen, Niland, and Zani (1969) employed 
group model-reinforcement techniques in an elementary school setting. 
Using Gronlund’s Sociometric Test, high and low social acceptance students 
were identified within classrooms and distributed into two treatment 
groups: (a) model-reinforcement group counseling and (b) reinforcement 
group counseling (no model). In the model-reinforcement group treatment, 
sociometric “stars” (high acceptance) were used as models. Groups in 
this treatment consisted of six Ss with even distribution of stars, low 
social acceptance students, and sex; there was at least one model for 
each sex within each group. The reinforcement counseling treatment 
consisted of groups of six Ss. Eight sessions, two per week for four weeks, 
were conducted for each of the two treatment conditions. On a sociometric 
posttest Ss within the model-reinforcement group displayed significantly 
more gains than those Ss in the reinforcement group or the activity control 
group. A two month follow-up revealed that gains were essentially main- 
tained. Members of the reinforcement-only group showed no changes above 
controls after treatment and after follow-up. The results of this study are 
worth noting since the dependent measure is easily obtained in a school 
setting and the procedure could be easily carried out at the public school 
level with little training. 

Ritter (1968) used group procedure to compare the effects of contact 
desensitization and vicarious desensitization on the elimination of snake 
phobia. The procedures were similar to those used by Bandura, Blanchard 
and Ritter (1969) with the addition of multiple subject exposure. Vicarious 
desensitization constituted one condition in which five peer models were 
seated in one extremity of a room in which a snake was contained in a cage. 
Subjects were ushered into the room and the E removed the snake from 
the cage and began to pet it; he invited the models to do likewise. 

The second treatment was contact desentitization in which the E 
sat in a room with a group of Ss and removed the snake from the cage and 
began to pet it. The E continued to do this until one of the Ss expressed a 
desire to participate. When this occurred $s participated through a struc- 
tered sequence of increasingly intimate contacts with the snake, ranging 
from stroking the snake with a glove on to petting the snake with bare 
hands. Subjects in this group took turns being teacher and model, A control 
group did not come into contact with snakes but did take assessment tests. 
All items on an avoidance test were successfully performed by 80% of the 
Ss receiving contact desensitization as compared to 53% of those receiving 
vicarious desensitization. None of the controls were able to complete the 
avoidance tests. Reductions in generalized fear were not significant for any 
of the groups. 
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Comment 


The potential that imitation learning holds for practical application 
in therapy and counseling appears to be significant. Research efforts to 
date on this phenomenon have been relatively few. The multiplicity of 
terminology and concept and the apparent lack of communication between 
academic disciplines concerned with the concept are probably the best 
indicators of the infantile state of knowledge of imitative processes. The 
greatest theoretical gap lies between those following vicarious paradigms 
and those following the operant paradigm, with the operant researchers 
apparently inclined to be a little less theoretical. The reason for the lack 
of theory in operant imitation is probably due to a belief that imitation is 
simply another instance of operant conditioning. This may be true, but 
it is oversimplified. Although imitation may function operantly it appears 
to be more complex than the standard shaping procedure. This seems due, in 
part, to the fact that imitation usually deals with the acquisition of gross 
units of behavior within which more complex interactions exist. In general, 
there appear to be too many theories to explain the same behavorial 
phenomena. 


The needs for further research are numerous and pertain to almost 
every aspect of imitation theory. Too many researchers have become fixated 
on the same experimental design using the same dependent variables. There 
is a need for generalization and variation if results are to be of any 
practical clinical use. This of course reflects the need for more work to 
be done in clinical settings and dealing with clinical variables, Some of the 
clearest and most significant results have come from studies of pathological 
behavior. This should encourage further work with a wider variety of 
behaviors, especially using operant methods which seem to date to have 
the most significant clinical impact. 


The possible application of imitation to the fields of therapy and 
counseling represents an exciting potential. Those who feel a disdain for 
“hardware” approaches to therapy will probably find these possibilities 
distasteful. Such a person might, however, want to reconsider his judgment 
if imitative learning (with the therapist as the model) appears to be an 
inescapable factor in therapeutic relationships which, in fact, demands 
control or at least consideration, For those who do not feel adverse to 
instrumental approaches to therapy and counseling, the implications are 
probably more obvious. 
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Educational Testing Service 


In this paper we present a discussion of the use of 
educational tests in guidance services as seen in the light of modern develop- 
ments in statistical theory and computer technology, and of the increasing 
demands for such services. A focus and vocabulary for this discussion is 
found in Turnbull’s recent article on “Relevance in Testing” (1968). Fol- 
lowing an introductory discussion of the need for guidance services, some 
very recent work in Bayesian inference is reviewed and the implications of 
this work for educational research methodology are noted. Special attention 
is given to the Lindley equations which provide solutions for a number of 
problems in the comparative prediction of academic achievement. We 
suggest that in a changing educational environment the Bayesian method- 
ology can provide an increase in the effectiveness of such programs as 
Horst’s monumental Washington Pre-College Testing Program. We see 
comparative prediction as an idea whose time has come. 


Some Vocabulary of Educational Technology 


Personnel problems may be grouped together in many ways, and the 
typical problem can usually be considered from more than one point of 
view. We shall be restricting ourselves here to those problems that can 
conveniently be viewed as problems in guidance and/or selection. Another 
very useful and contrasting grouping is that involving problems of classi- 
fication. The classification problem has recently been studied in depth by 
Rulon, Tiedeman, Tatsuoka and Langmuir (1967). Further references to 
this area may be found in that book. Wolfe’s (1969) review of that book 
gives some glimpse of the mathematically more sophisticated allocation 
methods now being used by the military services. An up-to-date review of 
Guidance and Counseling appears in the April 1969 issue of the Review 
of Educational Research. 

The standard classification problem involves a closed system of assign- 
ment of each member of a group to one of several subgroups or classifica- 
tions. These classifications are often defined by a subsequent training 
program, job assignment or, more generally, by subsequent treatments. The 
military services are typically concerned with a classification problem when- 


459 


REVIEW OF EDUCATIONAL RESEARCH Vol. 40, No. 4 


ever a group of recruits completes basic training. Each recruit must then 
be assigned to one of a number of service schools or to one of a number of 
on-the-job training programs. If each of these schools and each of these 
programs is viewed as a classification, then this personnel problem may 
be viewed as one of classifying each of the recruits. There is usually a 
quota for each of the classifications which must be filled and a maximum 
number that may be assigned. Often these requirements are not totally 
inflexible so that some latitude of choice may be permitted to the individual. 
Whenever substantial choice is present the problem may be viewed in the 
context of the guidance-selection paradigm. 


Guidance and selection problems occur each year at the point of tran- 
sition for students from secondary school to university. The typical student 
will wish to consider entering one of a number of colleges, universities or 
other institutions for further education, and within each institution he 
will have a variety of programs from which to choose. Many factors will 
affect his final choice, and one of these will surely be his expectation of his 
potential success within each of the various programs. The guidance prob- 
lem with which we shall be concerned is that of developing statistical 
methods to enable the student’s high school, the sending system, to make 
accurate predictions of each student’s potential success in each college, 
university or other training program that he may be considering. Indeed 
an important task of any guidance service will be to suggest to students 
that they may be qualified to enroll in programs that they had not pre- 
viously considered. Such services can encourage potentially qualified stu- 
dents, whose backgrounds have not given them expectations of college 
attendance, to consider this alternative. Conversely, students with overly 
demanding parents can be warned away from high prestige programs for 
which they are not qualified. Predictions of performance will provide 
each student with one useful piece of information that will help him, 
with the assistance of his guidance counselor, to make an informed and 
rational decision. In the pure guidance problem, as described here, the 
student is free to enroll in any of the programs he may be considering. 
In practice this will not be true for most students. However, the statistica 
methods developed for the pure guidance problem are equally valid 
when there are restrictions, provided only that a large measure of choice is 
left to most students. For example, a peacetime volunteer army might find 


the guidance paradigm very useful, but a mobilization army would find 
the classification paradigm more appropriate. 


fe Nairn cog SE that the students and their counselors are concerned 
ia pasar rg th a university admissions officers are concerned with the 
Hoe! roble x -Dest possible entering freshman class. In a pure selec- 
EE n ni is assumed that there are more applicants than vacancies 

at each accepting system is free to take just those students that it 
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believes to be best qualified. The pure selection model (which in the 
educational context might well be called an acceptance model) is often 
approximated rather well and the statistical methods developed for this 
paradigm will find equally wide application even in those situations when 
it is not, provided only that some positions exist for which multiple appli- 
cations have been received. Actually, in most instances the better students 
receive acceptances from many colleges so that no college can be sure of 
getting every student it selects. 


We shall use the term comparative guidance to describe any system 
of information transmittal designed to provide a student with information 
about two or more possible career opportunities. We shall use the term 
comparative prediction to describe any system of making predictions for 
two or more possible career opportunities. Horst’s techniques of multiple 
absolute prediction (1955) and multiple differential prediction (1954) are 
two important techniques useful in comparative prediction. 


Scientific Method and Humanistic Goals 


The important distinction between the guidance-selection and classi- 
fication paradigms is the degree of compulsion characterizing each system. 
The classification paradigm adopts a purely actuarial outline which, in the 
extreme, delegates to the computer irrevocably the task of assigning each 
person to an “optimal” treatment. The guidance-selection paradigm, how- 
ever, leaves the choice of college by the student and the choice of students 
by the college to a relatively unstructured but informed interactive process. 
In the extreme the classification paradigm is completely mechanistic. The 
guidance-selection paradigm, however, is fundamentally humanistic, yet it 
adopts a quantitative scientific approach to the greatest possible extent 
consistent with the realization of the aspirations of the largest possible num- 
ber of individuals and a degree of overall efficiency of selection from 
society’s point of view. 

When society adopts the formal classification model it is satisfied 
because assignments, on the average, are then good. The student, however, 
is unconcerned with such average good, but is concerned with whether or 
not his particular assignment is good. If he perceives that he belongs to 
some subgroup for which, on the average, poor assignment decisions are 
made, it will not comfort him to know that the system works well for al- 
most everyone else. His reaction to such a perception may well be to drop 
out of the system and this reaction may be determined more by perception 
than reality. 


It is thus essential, from the student’s point of view, that the overall 
personnel decision procedure visibly take into account not only the needs 
of society, which must reflect the needs of the individuals, but also, speci- 
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fically, the individual needs of its people. Thus personnel decisions must be 
both efficient and fair. Cronbach and Gleser (1965, p. 137) have pointed 
out 


that an abstract conception of “justice” lies behind much of 
the concern [in testing] about error of measurement. An ability 
test is expected to rank persons from best to poorest, and error 
distorts the ranking. Since such distortion is “unfair” to the 
individuals who are ranked lower than they deserve, testers want 
to reduce errors of measurement. 


Reflecting the accepted point of view at the time of the first edition of their 
book (1955) they argued further that “from a utilitarian point of view, 
these errors can be ignored unless they alter the goodness of whatever 
decisions are to be made.” [Italics added.] 


The first part of Cronbach and Gleser’s statement is clear and must 
be accepted as an important contribution to our thinking about what 
constitutes both good tests and good testing procedures. The second part, 
however, must be interpreted in the light of the social and political develop- 
ments of the last decade, and as a result the last phrase (italicized) of this 
quotation must bear heavy emphasis. Recent developments have resulted 
in Cronbach’s more recent writing (Cronbach & Snow, 1969) indicating 
more specifically that in present day American society a more elaborate 
utility structure must be considered than has been in the past. The utili- 
tarian point of view from which Cronbach and Gleser’s remarks were 
interpreted was that of the testing organization and those responsible for 
the selection of students. It is not necessarily that of most examinees. It is 
now recognized that the student’s point of view must be considered more 
carefully than it was a decade ago. Coleman (1969), for example, eloquent- 
ly pleads for a symmetry principle in college choice. Raiffa (1968) readably 
discusses Arrow’s (1951) work showing possible inconsistencies between 
group optimality and individual person optimality. 

It is generally accepted that the utility of a procedure will be an in- 
creasing function of its overall mean effectiveness. We may also feel, 
however, that its utility will be lessened if its effectiveness is very low for 
certain recognizable subgroups. If so, then the concept of fairness becomes 
a component of utility and cannot be ignored. A procedure that is man- 
ifestly and grossly unfair to any subgroup of people will not be a satis- 
factory procedure even if “on the average” it is very good simply because 
it is very good for most people. By directly quantifying considerations such 
as these in formal decision theoretic terms it would be possible to handle 
them within the classification-decision theoretic paradigm. It will be more 
natural, however, and more useful to treat these problems carefully but in 
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a less structured way within the dual guidance-selection paradigm. This 
can be done by examining regressions within relevant subpopulations. 


It is now generally recognized that the maximization of performance 
on any one particular criterion is seldom the only consideration relevant to 
a guidance or selection decision. It may well be that a particular student 
would have a higher grade-point-average in business school than in law 
school, but if he strongly prefers law to business and if he can be assured 
that he can “make it” through law school, this may well be the best choice 
for him and for society. That is to say that his degree of career satisfaction 
and his overall long-term contribution to society may be greater if he 
attends law school. In cases such as these the important contribution of a 
prediction technology will be to assure that his initial choice is a reason- 
able one. 


goals in their student body, understanding that such diversity and balance 
create a richer university experience for all of their students (e.g, see 
Whitla, 1968). Often American colleges accept students from underde- 
veloped areas, both domestic and foreign, not because they necessarily 
believe that these students will be “better” than others that are turned 
down but simply because society, at large, has a greater need to train 
these students. The pertinent question in relation to these students is not 
how well they will do in any absolute or even relative sense but whether 
they will profit sufficiently from the program. Operationally this often 
reduces to simply trying to predict whether or not these students will be 
able to complete the training program satisfactorily, even at the most 
minimal level. 

This humanistic tradition (Katz, 1966) also takes as a basic precept 
the notion that an individual will not necessarily be most happy doing 
the kind of work for which his aptitudes best qualify him. The fact that 
a high school senior is the best typist in his class and only the second best 
mathematician should not automatically suggest advanced training at a 
secretarial school. Most people will be well qualified to pursue more than 
one vocation successfully. The scientific approach to personnel guidance 
views the task of prediction as one of informing the individual as to the 
extent of his probable “success” in those training courses and vocations that 
interest him. The humanistic tradition allows that the choice, whether of 
college or vocation, be left to the individual, to the extent that that choice 
does not make unacceptable demands on society. 

This humanistic tradition also takes as a precept the belief that neither 
the mechanical efficiency of society nor its gross material output is the sole 
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or even the primary goal that personnel technology should serve. If it is 
agreed that the function of society is to serve all of its people, then, right- 
fully, any maximization must be of the benefits to these people rather than 
to the structure of society. Very rigid manpower policies can guarantee 
having neatly ordered tables of organization, but orderly structure does 
not guarantee either work efficiency or career satisfaction. Democratic 
societies always seem less orderly than totalitarian ones, but this lack of 
= has proven to be productive, satisfying, and ultimately more 
“ reg 

One feature of the problem that becomes apparent immediately is that 
short term optimization is often at the expense of long term good. A mature 
college graduate will not necessarily select that graduate program in which 
he feels he can do best. Rather he will ask himself what kind of training 
he needs for the kind of work he wants to do. He may find that this train- 
ing can be obtained only at a university at which, and in a course in 
which, it is predicted that he will do at best moderately well. But if he 
can be satisfied that there is a sufficiently high probability of successfully 
completing that program, then he may find it to his advantage to forego 
the attainment of immediate honors from an easy program in favor of the 
long-term benefits from the more difficult program. 

More broadly, it is now recognized that decisions must always reflect the 
desires and aspirations of the individual and the needs of pene as they 
represent the combined aspirations of its people. Much work has been 
done to develop formal mathematical systems that incorporate both prob- 
abilities and utilities. A knowledge of these methods is very useful (see 
poetics treatment of the application of decision theoretic models 
sornak Hecke Ch a woe and Gleser’s Psychological Tests and Per- 

With presently developed technology it does not seem to be feasible 
to handle the quantification of RYA part of the kind of centralized 
comparative prediction service that we shall be discussing in this paper. 
B en meaningful and accurate, formal quantification can be very useful. 
A a strained, inaccurate quantification would be mechanistic and stifling. 

F hest £ utilities should at present be left to the student and his 
s babili counselor. What can now be done is to provide a well explicated 
a ts ilistic system which the student, guidance counselor and admissions 
utility rae be, ro m concrete basis for their own relatively informal 
pe! nary i S oubt an expansion in guidance services should be 

l ee y increased training in utility analysis for guidance coun- 
a be position is entirely consistent with that of Cronbach and 
rok who in the second edition of their book remarked (p. vii) that 

rk since 1955 has reinforced our judgment that decision theory is 
more important i i i 
p as a point of view than as a source of formal mathematical 
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techniques for developing and applying tests.” Nevertheless, the develop- 
ment of simple guides to precise and meaningful utility analysis for career 
guidance is certainly a research area that should now be receiving high 
priority. 

The possibility of turning important personnel decisions over to a 
computer has been tempting. For a time this approach exuded an aura 
of relevance, objectivity and precision, the very qualities that justify 
educational testing. But contemporary youth demand a greater personal 
participation in the determination of their future. They now rebel against 
any vestige of authoritarianism in education, even when it is one more of 
form than of substance. In this atmosphere the dehumanizing effect of 
unmoderated computer-made classifications will be enormously costly. 
There is a limit to the benefit that can be obtained from investment in 
computer hardware. More efficient computers may, in fact, be needed for 
work in personnel guidance, but the need for more and more thoroughly 
trained and equipped guidance personnel and for more relevant and ac- 
ceptable quantitative tools is far greater. Improved computer facilities 
which students can manage directly in an interactive mode, however, may 
prove useful in relieving the counselor of some of the burdens of informa- 
tion storage and retrieval. But above all else the goal must be to maximize 
the informed participation of the student in the determination of his future. 


Stages in the Development of Educational Testing Technology 


The earliest successful work in educational testing was of a mani- 
festly empirical character. By designing tests having direct relevance to the 
operational task given him, Binet (see Chauncey & Dobbin, 1963) was 
successful in discriminating between those French schoolchildren who were 
or were not able to benefit from the particular school program available 
to them. 


With the resources, technology and personnel available during the 
latter part of the nineteenth century it was not possible to develop a multi- 
plex of testing procedures each tailor-made for a particular action decision. 
Partly for this reason Binet’s methods did not enjoy wide application in 
Europe; but with little delay these methods crossed the Atlantic and, 
particularly with the beginning of World War I, found rich soil in which 
to grow. 


An important theoretical step was taken when Spearman, in England, 
proposed a single ability factor theory to account for the relationships 
among test scores, and between them and academic success. According to 
Spearman, each student could be thought of as having a unidimensional 
ability which accounted, in large measure, for his performance on various 
tests and on various academic tasks. Each of these tests and tasks had its 
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own specific component, but the various tests and tasks shared only a 
single general dominantly important factor. Such a theory suggested that 
the purpose of testing was to measure this general factor, called intelligence, 
and to rank individuals so that the more able students could be identified. 
While it was acknowledged that specific tests and specific tasks might have 
specific features, these were considered to be relatively unimportant. 


Spearman’s theory supported the development of the IQ test which, 
for a time, was the major component of educational testing. Undoubtedly 
this single factor or single trait approach to testing enjoyed popularity. 
because it justified a basically unidimensional approach to educational 
testing, and such an approach was perhaps all that the resources, tech- 
nology and personnel of its day could support at the operational level. By 
reorienting testing from the task-oriented prediction paradigm, which avail- 
able resources and technology could not support, to the universal ability- 
oriented concept of measurement, which resources and technology could 


support, psychologists made possible a dynamic and immensely useful 
growth in testing. 


The unifactor theory did not long hold preeminence. The theoretical 
simplicity and practical utility which buttressed it soon gave way before 
the onslaught of Thurstone’s succession of studies showing that human 
ability is not unidimensional and hence simple, but multidimensional and 
hence complex. Thurstone demonstrated conclusively that it was useful 
to isolate many human ability factors and that persons’ rankings on these 
fectirs could vary substantially. Technologically this meant that psychol- 
ogists should construct multiscale tests and that, in specific applications, 
weighted composite scores should be used based on just those scales rele- 
vant to the short and long run implications of the intended decision. 


Thurstone’s theoretical position triumphed not only because it was 
superior to Spearman’s, but also because some increase in available re- 
sources, and the consequent technological breakthroughs, made it possible 
i testing practice to partially reflect his ideas. However, partly because 
As poe Sarg o cost, many major educational testing programs as contrasted 
Speartian A O n E have adopted a compromise between the 
workable proced e *Aurstone positions. It has been found that a very 
di as ure is to measure and report measures on two omnibus 
dimensions of human ability labelled verbal ability and quantitative abil- 
ity. These measures have the crucial advantages of simplicity and under- 
pe ig the absence of which had limited the use of more complex 
of the ae Bese al recently these measures constituted the core 
peg —— testing programs. In addition to their reporting 
these m ea. Shen relevance to immediate academic decisions, 

easures have proved popular because they are believed to measure 
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a broad spectrum of abilities relevant not only to immediate prediction in 
the academic situation but also to the more long term and important ques- 
tions of future job success. In Cronbach and Gleser’s (1965) terminology 
these tests have a wide bandwidth. These tests will not and should not be 
abandoned in the very near future, but they must be modified, extended 
and supported, as they now are beginning to be in a few testing programs, 
by newer and more immediately relevant tools. 

In recent years the technology of testing has been developing rapidly 
and the sophistication of persons in the field has also been rising, slowly 
at first, but now with greater acceleration. Coupled with this has been a 
dynamic development in the computer systems available both to test pub- 
lishers and test consumers. The task facing educational testers has also 
changed radically. At one time the objective of testing was to aid in the 
selection of the few. Now, in the decade of the seventies and beyond, the 
objective will be the guidance of the many. As a result of these advances 
educational testing now stands poised for major developments in programs 
and related services which may well have great significance for American 
education. 

In discussing the College Board Program, Turnbull (1968) identified 
possible future stages in the development of testing programs. The first of 
these is of concern to us here because the methods surveyed in this paper 
are directly and immediately relevant to it. This next stage, as Turnbull sees 
it, is the stage of multiplex external programs, which involves “an exten- 
sion of the recent trend toward the diversity of testing programs and of 
tests within programs.” 

This stage involves a giant step in the “tailoring” of testing to meet 
the demands for decision making relevant to specific examinees and specific 
choices. The trend here is away from, or towards a supplementation of, 
the omnibus testing which has served as a workable compromise between 
the Spearman and Thurstone approaches, and directly toward a Thur- 
stonian recognition of the multidimensional complexity of human abilities 
and the multidimensional requirements for effective personnel decision 
making. Operationally this involves constructing tests of narrowed band- 
width on the theory that their fidelity, i.e, predictive power for specific 
decisions, may be increased. Turnbull, however, suggested that there is a 
“missing element,” “a way to express the results of both standardized 
tests and school performance in terms meaningful to post-secondary edu- 
cation, in a language at least as well understandable . . . as the College 
Board scale.” 

The methods described here adopt a sophisticated improvement of a 
long available reporting language—one that is much more understandable 
than any test score. These methods provide, for each student and for each 
college or program in which he is interested, understandable, meaningful 
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and maximally accurate predictions of his potential performance in edu- 
cational opportunities that are relevant to his goals. A reporting system 
with this objective has been in operation in the state of Washington since 
1960 as part of Horst’s Washington Pre-College Testing Program. Other 
statewide systems have also been in operation, e.g., Georgia (Hills, 1964, 
1966; Hills & Klock, 1965; Hills, Klock & Bush, 1964, 1965), Utah (Jex, 
1966), Minnesota (Johnson, Swanson, Joselyn & Berdie, 1961) and Iowa. 
Several testing organizations have also recently taken important steps in 
this direction. Some giant evolutionary steps must now be taken in the 
further development of such programs. These steps should lead to mean- 
ingful improvements in an existing system. 


For one thing, predictions for college applicants should often be made 
in two forms. The student should be given a point and/or interval esti- 
mate of his future grade point average both for his first year in college 
and (at that point or later) for the entire four year program. He should 
also be given both point and interval estimates of the probability of his 
completing both the first year and (at that point or later) the entire four 
year program. These certainly are understandable quantities, and of im- 
mense immediate relevance to his problem of selecting a college or other fur- 
ther education program. Emphasis on the second kind of prediction is not 
found in current practice. 


Students need much more information about college curricula than 
they are now getting and some training and guidance in decision making 
would be useful. A thorough discussion of these latter problems is given 
by Katz (1963, 1969a, 1969b). It would also be useful if students were 
informed of the probability that they would be accepted by each of the 
colleges to which they might wish to apply. All of this requires immense 
computer storage and computation speed, but the needs are not beyond 
present day capabilities. 


~ Thus, after more than 80 years, and only after major breakthroughs 
ah ee theory, testing technology and computer resources, is it now 
possible to use on a broad scale the multiplex, direct task-oriented system, 


validated empirically by Binet, rationalized th i hurst 
and advocated for so many years by py co T 


ri Ara rant ae despite our previous discussion, readers may still feel 
ae approach sacrifices educational meaningfulness to attain statis- 
ical e ciency by focusing attention primarily on narrow criteria such as 
nee point average (GPA). It is important that an answer be given im- 
m iately to that thoughtful query. That answer is based on an under- 
standing of the nature of the decision problem for which educational tests 
and the statistical prediction methods accompanying them are used. 


Historically, selection methods at the university level have focused on 
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the prediction of first year GPA. The major virtue of GPA is its avail- 
ability though there have been studies showing some relationship between 
first year performance and subsequent academic performance and indeed 
later career performance. These latter correlations, however, are by no 
means impressive. For a brief summary of these results see Holland and 
Richards (1965). For a more detailed treatment see Hoyt (1965, 1968). 
Ideally, guidance services should report predictions of occupational per- 
formance to the extent possible. 

College grades have many weaknesses and much work needs to be 
done to improve grading standards. They suffer from many of the same 
weaknesses common to letters of recommendation and other subjective eval- 
uations, as well as from overdependence on a student’s verbal skills. In 
college it is important to be able to discuss learned material intelligently. 
In later life it is important to put that material to practical use. We now 
realize that it is absurd to grade automobile mechanics with an essay exam- 
ination, but how often do we ask a student in a class in learning theory 
to actually do something with his newly acquired knowledge? If we did, 
perhaps his course grades would better predict his vocational success. 

The prediction of GPA is undoubtedly tied up with a proper emphasis 
in universities on academic excellence, the desire of individual teachers to 
instruct good students and the desire of institutions to produce scholars. 
In part this is a carry-over from an earlier age when the pursuit of learn- 
ing was considered to be its own reward. Such an attitude remains reflected 
in university policy because to discard it completely would destroy our 
universities as we now know them and particularly their essential roles in 
basic research and human enlightenment. However, our leading univer- 
sities have shown that it is possible to maintain academic excellence and 
extensive programs of basic research and at the same time serve the larger 
needs of society. Thus, for example, not every graduate student doing work 
in mathematics is now pointed toward a career in basic research and teach- 
ing in mathematics. Rather, a large percentage of students taking such 
courses are doing so only to acquire needed skills for technological appli- 
cation. This has been true for many years now, but only in recent years 
have educators been willing to speak directly in these terms. 

It would be an unwarranted digression to explore here all of the 
complicated ramifications of that development. But the demands that this 
thinking places on score reporting-prediction procedures are relevant. One 
clear requirement is the reporting of an estimate of the probability of 
success in the particular university or particular course of study for each 
student. A student may not wish to attend the university at which he 
would do best, and a university may not necessarily wish to take only 
those students who will do best. Rather, each may seek a matching of stu- 
dent to program so as to offer the prospect of the student making a 
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significant contribution to society, provided the student has a reasonable 
probability of completing the course. Thus the guidance and acceptance of 
a student will depend upon the relevance of a particular program to the 
abilities and goals of the individual student and those of the individual 
institution. The Bayesian methods discussed in this paper are oriented 
towards this approach to score reporting and prediction. The guidance and 
selection models currently available must be extended to provide interval 
estimates both of GPA and of successful completion of the program of study. 

It should also be pointed out that at most rungs of the educational 
ladder selection typically has a dual purpose. There is first the task of 
selecting students who will successfully complete that particular program, 
but there is also the task of feeding into the system a pool of students who 
will eventually be qualified for consideration for selection at the next level 
of training. Thus, for example, selection at the undergraduate level must 
bear in mind the needs of society for qualified entrants into graduate school. 
For this purpose a prediction of GPA is more relevant than a pass-fail 
prediction. For this reason it again seems useful to report predictions in 
more than one form. 

There are, of course, more standard methods of central prediction (for 
example, see Tucker, 1963) and the reader undoubtedly now wonders why 
it is necessary to have a new statistical methodology. The problem arises, 
in part, because of the present dynamic nature of American education. 
Previously, curricula within colleges remained relatively unchanged for 
many years, and colleges themselves changed their natures even more 
slowly. Therefore data could be collected over a period of several years 
with the assurance that regression equations determined from the data 
would be applicable and useful for another several years. Thus the size 
of the sample available for any study was limited mainly by administra- 
tive difficulties in gathering data. 


Many American colleges no longer always exhibit such stability. Pro- 
grams with 


in colleges can change dramatically in just a year or two 
and thus historical data may now have only descriptive value. Much 
stability may still be found in the Ivy League universities, the large mid- 
western state universities and, in fact, in most of the more highly visible 
institutions. But the major growth in educational opportunities during the 
1970’s will be at the community college level, and here stability has not 
been common. Even in the more stable institutions some programs undergo 
substantial change in a very short period of time. We can also expect 
graduate education to continue to undergo periodic changes. Thus, for 
example, if a graduate psychology department were to increase substantially 
the mathematical content of its curriculum, the use of regression equations 
from previous years would be very unsatisfactory. Since criterion data are 
available only after several years of acceptances have already been made, 
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there is a clear demand for some way of writing prediction equations that 
takes account of whatever small amount of relevant data there is with 
respect to that college and also the experience that other similar colleges 
have had in such a situation. If a particular college can identify its new 
program as being similar to that of certain other colleges, it would un- 
doubtedly want to draw on the experience of these colleges. This would 
also be true of a college undertaking prediction problems for the first time. 

The Bayesian methods discussed here have a virtue peculiar to them. 
These methods make it possible to increase the accuracy of predictions for 
the individual not only by gathering additional data about him and the 
college to which he is applying, but also by gathering additional data about 
the group of which he is a member and about colleges similar to the ones 
to which he is applying. It is a significant virtue of the Bayesian method 
that our knowledge concerning groups of students and of colleges gives us, 
probabilistically, information that can be translated into more accurate 
predictions for each individual. When a great deal of data is available 
about a particular college it turns out that the Bayesian method yields 
precisely the same result as the classical method. The statistical basis for 
this will be discussed in the next section. 

Thus the Bayesian regression approach to be presented here, with its 
increased sensitivity and potentially with dual predictive modes, seems to 
be the natural approach to the guidance-selection problem and, as we 
have seen, the mode of score reporting is both easily understood and 
maximally informative. Some details of this proposed reporting system are 
given in later sections. 

The step now being taken in the evolution of educational testing 
technology is one that is firmly grounded in a succession of historical de- 
velopments. This step is in no sense revolutionary, and while there have 
been many important contributions it is not the child of any single person. 
Nevertheless, if this next step is taken specifically in the direction suggested 
in this paper, it will be a giant step. It will involve the embracing of 
statistical methodology that is only slowly losing its controversial status. 
It will also mean that though some of its techniques will remain im- 
portant, useful, even essential, the entire measurement tradition will lose 
its primacy as a basis for developing operational testing procedures. But 
here again we do no more than echo the prescription contained in Cron- 
bach and Gleser (1965) to abandon the view expressed by Hull (1928) 
that “the ultimate purpose of using aptitude tests is to estimate or forecast 
aptitudes from test scores.” Surely it must be recognized that relevance in 
testing cannot be inferred solely from the estimation of true score. 


Bayesian Methods in Educational Testing 
Reviews of Bayesian methods have recently been given by Meyet 
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(1966, 1969). We shall now describe Bayesian analyses for two important 
new models. This presentation is meant to provide a technical basis for 
an improved guidance-selection system. A parallel verbal presentation 
of this material is given at the beginning of the section on An overview 
of new developments in testing services. Many readers may prefer to see 
this verbal summary before examining the explicit quantitative statement of 
the models. 


The first Bayesian analysis we shall describe is that of the classical 
test theory model and the second is a regression model for several sub- 
groups. The first of these will be familiar to measurement specialists who 
should see from this discussion the intimate relationship between test 
theory and Bayesian inference. The second model provides results which 
are similar to those of the first model and which are directly relevant to 
comparative guidance services. Finally we survey Bayesian methods in the 
analysis of variance components. 


Within the classical test theory model each person’s observed score 
x on a test may be used as an estimate of his true score 7. If this is done, 
the standard deviation of the errors over persons for such a procedure 
(the standard error of measurement) will be ox(1 — pxx:)'/*, where 
x is the observed score standard deviation and pxx: the reliability of the 
test. This provides a measure of the inaccuracy, on the average, of esti- 
mating the true score from the observed score x. An alternative method 
of estimation is to use the weighted average regression estimate x pxx + 
x(1 — pxx:) where uy is the mean of the observed scores in the population 
of persons. If this is done, the population standard deviation, over per- 
sons, of the resulting errors (the standard error of estimation) is oxpxx'” 
(1 — pxx)™. 
k As to be seen on comparing formulas, the standard error of estima- 
ion is always less than the standard error of measurement, and substantially 
so when the reliability of the test is not large. Thus, by incorporating 
pi aptorad known values of the reliability and the mean observed score into 
e estimate of true score by means of the weighted average regression 


estimate, a better estimate is obtai 
ined th 4 
; i an that based solely on th 


aer AD Eo its mathematical derivation the regression formula 
about a person. ae Intuitive sense. If one has little or no information 
selected from eis aa can assume that he has in effect been randomly 
ability: level oe ation at hand, it seems reasonable to use the mean 

y eve! in the population as the estimate for that person. When the 
eliability of the test will be near zero 
will be very nearly equal to the population 
e test is very long the reliability of the test will be 


and the regression estimate will 
mean score. When th 
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near unity and the regression estimate for each person will be nearly equal 
to that person’s mean observed score. Thus, as one would expect, when 
very reliable information is available about a person’s true score one 
would need to put little weight on the mean population value. 

Unfortunately, there is a difficulty in attempting to apply the Kelley 
formulation in most practical applications because the population mean is 
typically not known before measurements are taken and hence the regres- 
sion formula cannot be used in its given form. In effect, what is needed is 
a regression estimate based not on the person’s observed score and a known 
mean observed score, but rather one based on the person’s observed score 
and the average observed score of a random sample of people from the 
population. Results of this type are available in the framework of Baye- 
sian methods with normality assumptions. The first of these was given by 
Lindley (see the discussion in Stein, 1962); later a fuller development was 
given by Box and Tiao (1968) and by Lindley (see Novick, 1969a). These 
estimates are of the form, wxx x + (1 — wxx) X, i.e., a weighted average 
depending on weights wxw, the person’s observed score x and the mean 
observed score X in the sample. In this formulation the quantity wxx is an 
estimate of the reliability of the observed score. It tends to unity as the 
number of observations on the person increases without limit and to zero 
as the number of observations on the person tends to zero. For intermediate 
cases its value depends on the relative number of observations on the 
particular person, the number of observations on all persons and also on 
the number of persons on whom observations are available. 

For our purposes it will be useful to consider Bayesian estimates 
obtained by Lindley since this method easily generalizes to the case of 
unequal replications. Under moderate conditions and using the specific 
Prior distribution suggested by Novick (1969a) to characterize a situation 
in which we have no prior information, Lindley shows that the mode 
of the conditional distribution (the posterior Bayes distribution) of the 
true scores Tg after obtaining all observed scores can be calculated as the 
solution of the m equations 


=: — 1)(Ts — 7.) 
a ce u 
Zs? + X(x. — 7)? Ilr — 7.)? 


where x;; is the j-th observation on the i-th person, m is the number of 

persons, n is the number of replications on each person, si? = X(x; — 

x.)?/n x. = 5 x;;/n and 7. = X 7;/m and where it is assumed that 
j 1 
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7, are not all equal. These are the simplest of the Lindley equations. 
Because the quantity Ts is a part of the mean value 7., these equations 
cannot be solved directly. An approximate solution to these equations 
for large m and n having the general form described above is 


Ns nN 
N Cr” 1/4 Or? 
a= t + ________ x.., 8=1,2,...,m, [2] 
Or + t/n Or? Or + /y On? 


where 6’s are the usual ANOVA estimates and x.. = m' 2 x, . For 
i 


small samples, equations [1] and [2] do not give the same results. Further 
details of a method for obtaining the exact solutions to the Lindley equa- 
tions by iteration, the modest conditions under which they are valid and 
reasons for preferring the Lindley method are given by Novick, Jackson and 
Thayer (1970). A generalization of these equations to include the case 
of unequal replication numbers and including technical improvement to 
guarantee convergence was provided by Lindley (1969b). 

The true score estimates given in [1] were obtained from a Bayesian 
structural model which assumes that the observed scores for each individual 
are normally distributed with mean equal to that person’s true score and 
with homogeneous error variance og? and that the true scores are nor- 
mally distributed with mean pr and variance oy?. It was further assumed 
that there was no information available about the true or error score var- 
iances or the mean true score. Formally this was accomplished by using 
the indifference prior distributions for Pr, Or" and og” suggested by Novick 
(1969a) as developed from the work of Novick and Hall (1965). These 
indifference priors consist of independent uniform distribution on Hr, log 
Tr and log a7. However, if some prior information is available either 
CLDR erbut of ve or error scores, this information can be in- 

nto the prior distribution using the procedure suggested by 
Nokk (1969a) as developed from the work of Novick and Grizzle (1965). 
Often it is useful and sometimes it may be essential to do this. However, 
it seems to be true that when the number of persons being tested is large, 
oe can be largely disregarded (Novick, Jackson & Thayer, 

The choice of the prior distribution for this analysis reflects pri 
i ‘ s sis s prior 
NUET and beliefs (or lack of them) concerning ye mean true score 
a are the spread of true score values and the average variability, 
oc abe a peas in total, imply a prior distribution on the individual 
> ef rs ter obtaining observations on persons we have a new 
ayes distribution for the 7, and we also have a new Bayes distribution 


474 


NOVICK AND JACKSON BAYESIAN GUIDANCE TECHNOLOGY 


for the mean true score, the variance of the true scores, and the variance 
of the error scores, and all of this information is available to guide any 
decision that must be made at any stage of testing. Lindley’s methods and 
the very similar ones of Box and Tiao provide improved techniques for 
estimating true and error score variances and reliability. The details are 
given in a paper by Novick, Jackson and Thayer (1970). 

The point to be emphasized here is that at any point in the data 
gathering the Bayes distribution for any particular 7; reflects more than 
just the observations on person i. Rather it reflects the combined infor- 
mation relevant to all of the 7,;. Thus after one obtains information on 
some 7, he is no longer completely uninformed about a new Ts; rather the 
prior distribution for this new Ts would effectively be the estimated distribu- 
tion of 7 values in the population of people. As has been seen, the effect 
of this is to regress estimates of true score towards a common mean. This 
regression provides the Bayesian solution to a number of statistical prob- 
lems. Thus for this rather complex Bayesian structural model the actual 
use of a vague prior for data analysis seems appropriate when the number 
of persons is large, but for less complex models objections can be raised 
(e.g, Novick & Grizzle, 1965). This is so because the buildup of informa- 
tion is much more rapid with the structural model than with simpler models. 

The Bayesian structural model has been applied by Lindley (1969a, 
1969c) to the estimation of regression coefficients. Suppose that a number 
of similar graduate departments of psychology wish to use the GRE ad- 
vanced psychology examination to predict a student’s performance on a 
final written examination and hence to supply one useful piece of infor- 
mation for their selection process. Then a student with a score of x has 
an expected score of y in the i-th department given by the linear model 


e(yi|x) = a + Bix 


where the parameters a, and ; depend on the particular psychology 
department and typically vary among departments. Thus there are pos- 
sibly different linear regressions in each department. We assume the 
distribution of y; given x; to be normal with mean as given above and, 
for present expository purposes, with known residual variance 07’. 

Whatever experimental data are available and deemed relevant can 
be expressed in the form 


yy = @ + Bixiy + €j 


fori=1,2,...,mandj=1,2,...m. 


Here it is supposed that there are m departments, that data are available 
for n; previous students in each of these departments, that xı; is the test 
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score of the j-th student in the i-th department, that yı; is similarly his 
final written examination score and that the residuals €,, are independently 
normally distributed with mean zero and known variance o;°. 

The usual statistical analysis of the data would proceed in the fol- 
lowing way. Each department would be considered separately and the 
regression line for that department estimated by the usual least squares 
procedure giving estimates 


z% Bi = S(x,y) /S*(x:) [3] 
a = yi.— Bux. > [4] 
where 
ni 
S(x,y.) = 2 (Yu — yi.) (Xs —x.) , 
j=! 
nı 
(x)= Z Exige o 5 
j=l 
ny 
x. = Š xis/ni 
j=l 
and 


ny 
y= } y/n , 
j=l 


these being the usual sum of products, sum of squares and means for the 
i-th department. 


i heres T pin out that this standard procedure is open to 
In estimating the regression for any [department] it fails to take 
into account experiences gained with similar [departments]. For 
example, suppose one found that the regression slopes, Bi, were 
typically around one, then one would expect the slope for [an- 
other department] to have about the same value and would be 
astonished if it differed sharply from it. This is perhaps most 
clearly seen by considering what one might intuitively do if no 
data were available for one [department] beyond an x-score for 
a single student; one would reasonably estimate his y-score as 
a + Bx where a and B were some sorts of means of the values 
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obtained for similar schools for which there were data. This 
criticism (of the classical method) can be overcome, and the sug- 
gested procedure just described made precise, by using a Bayesian 
argument in the place of the standard one. 


The dissatisfaction with the orthodox approach stems from the fact 
that one knows a priori that the slopes £, have similar values and that 
departments are not terribly heterogeneous; similar remarks apply to the 
ordinates a,. One therefore supposes that this prior knowledge is made 
precise by assuming that the individual f,’s are independent and identi- 
cally distributed with a normal distribution of unknown mean B and known 
variance; that the a, come from a similar distribution with known variance 
but unknown mean a, and that these are independent. Furthermore the 
knowledge of œ and £ is supposed vague. 

With these additional assumptions the full model is that 


Elyn) =a + Bixi [5] 
with variances 02, all the y’s being independent normal; that 
Ela) =a, E(B) = [6] 


with known variances, all o and £, being independent normal; and that 
the prior knowledge of œ and £ is vague. Lindley then has shown how 
for this model the posterior distributions of as and ß’s may be found. 
In particular he has shown how to obtain the modes of the posterior dis- 
tributions given the data, these modes providing modified estimates in 
equations [3] and [4]. The expression for the Bayesian estimate of £; 
has the form 


S(xy1) Fé 


a T 
A= $2(x,) +d: i 7] 


where the precise nature of cı and dj are not of concern here, Jtndley 
(1969a) has further pointed out, 


The terms c; and dı represent corrections to be applied to the sums 
of products and squares respectively in the light of the “prior” 
information we have about the parameters. Without c; and dj, the 
right hand side of [7] is the usual estimate Bi, equation [3]. Fur- 
thermore c; and d; depend not just on the pooled data for the 
i-th department but also on pooled data from the other depart- 
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ts. In particular they depend on all the other estimates 
s. j % i, and have the effect of regressing the estimates from 
ı to B,, where É, is nearer to the average slope than is A, . 
Hence the extreme slopes are modified by being moved towards the 
central values. The formula is even valid for a department about 
which there are no data; the estimated slope takes the form c,/ 
dı, though, in this case the ratio c,/d, does not depend on i. 
This form essentially says that we can estimate the slope for this 
department by regarding it as typical of the other departments for 
which data are available. Similar methods and results apply to the 
estimation of the intercepts a . 


When the variances o;? are unknown the exact Bayesian structural 
estimates of the a, and £, must be obtained as a solution to a set of Lindley 
equations (Lindley, 1969c) similar to, but more complex than, those given 
in [1]. An initial tryout of this method (Jackson, Novick & Thayer, 1970) 
seems to have provided reasonable results. An extension of these methods to 
the multiple predictor case has recently been given by Lindley (1970). 

the mathematical notation, the Bayesian solution to the 
m-group regression problem can, with only slight loss of accuracy, be de- 
scribed in very simple terms. Effectively what the Bayesian does is to 
compute the least squares regression line in each of the m-groups and also 
each group and computes a Bayesian regression line as a weighted average 
of the individual least squares line and the line for the pooled data in a 
manner analogous to that of obtaining the regression estimate of true score. 
This line can be shown to provide better predictions on the average than 
the least squares line. 


Bayesian Analysis of Variance 


The analysis of variance com 6 
ponents is frequently occurring statis- 
pete Whose solution leads to unexpected complications. A familiar 
or na j educational testing is the estimation of the true score variance 
T rror score variance Tg? in the classical test theory model, the 
data being n parallel measurements on a sample of m persons. 
3 ehg distributions of o:?, op? and the reliability coefficient could, 
en obtained as a byproduct of the analyses described at the be- 
A E of the section, and an advantage of the Bayesian methodology is 
oe ete pap te na inconsistency between the conclusions reached about 
rie parameters of interest, as there might be if each were estimated 
separately by some classical method which appeared “good” for it alone. 
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However, considerable light has been shed on the variance components 
problem by a number of Bayesian investigations in recent years. 

The information about ox? and as? provided by the data is sum- 
marized by the pair of sufficient statistics 


Sw = F(x — X.) Sa = En(x%, — x.)* ’ 
where xı; is the j-th parallel measurement on the i-th person, 


x), = = xı;/n is the i-th person’s average score, 
j 


x.. = Ex, /m is the overall average score. 


Sw and Sp are commonly referred to as the within-persons and between- 
persons sums of squares, and have associated with them m(n — 1) and 
(m — 1) degrees of freedom respectively. Dividing the sums of squares 
by their degrees of freedom we obtain the mean squares MSw and MSn with 
expectations 

E(MSw) = ox" 

E(MSs) = oe" + nor. ( 
Usual classical practice is to take MSw as an unbiased estimate G;* of Fr? 
and n7(MS, — MSw) as an unbiased estimate or* of or". Clearly 
@:* can be negative, which is felt to be somewhat absurd, and many 
modifications of classical methods have been proposed to deal with this 
situation, none of them entirely satisfactory. 

The Bayesian methods always leads to a nonnegative estimate of ox. 
Also a number of writers, using Bayesian methods, have brought into clearer 
focus the implications of a classical estimate Gx? which is substantially 
less than zero. This work graphically illustrates the fact that such a result 
casts grave doubts on the assumptions of the model, particularly the 
assumptions of parallelism and experimental independence of the replicate 
measurements. This work also highlights the weakness of the classical 
estimate of error variance in that it fails to use the information in the 
between sum of squares. A survey of the technical details of this work has 
been given by Novick, Jackson and Thayer (1970). 


Multiple Comparisons and the Choice of Predictor Variables 


Two problems that have been of intense and continuing interest to 
data analysts in education, psychology and other behavioral sciences and 
which are important in the development of a Bayesian guidance technology 
are those of multiple comparisons and the choice of predictor variables. 
For each of these the Bayesian position seems So sound, even compelling, 
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when viewed in the context of the models discussed earlier that some com- 
ments on these topics seem appropriate here. A more detailed treatment 
of these problems is given by Novick (1969b) and the basis of work by 
Lindley (1965). 

The multiple comparisons problem arises when one attempts, say, to 
simultaneously make individual comparisons of the differences among a 


lar objections arise when less trivial formal procedures are used. Of course, 
there are “experiment-wise,” as opposed to “pair-wise,” methods of treating 


It is a 
but they need not be. As Novick and Grizzle (1965) indicated, when 
uniform prior distributions are used a priori on mean parameters, then 
the Bayesian method Yields results very similar to the classical pair-wise 


Prior distributions; they demonstrated that his construction yields more 
satisfactory results than those obtained from the classical pair-wise approach, 


If one views each “treatment” (i.e, mean) effect as an effect ran- 
domly selected from a population of such effects, then any decision he 


However, when the Structural model i i hod will 
regress observed Siete a, maul used the Bayesian method w. 


ments, and this regression will diminish w. 
kon in any application, that he should 
ria it is probably in part because he does not “expect” to find many 
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differences to be small. To give his procedure the desirable amount of 
conservatism he need only quantify those prior beliefs that justify this desire. 
If he is inaccurate in his assessment of the parameters of the distribution 
of treatment means he will at least know that on the average his posterior 
beliefs concerning these parameters will be more accurate than his prior 
assessments, as well as his decisions based on them. Actually, for reasons 
already given, the Bayesian structural model even without presumed prior 
information avoids the problems encountered with uniform prior distribu- 
tions or individual parameters. 

The problem of the choice of predictor variables is a variant on the 
theme of multiple comparisons, attention being shifted from a considera- 
tion of treatments having nonzero difference from the mean of the treat- 
ment effects, to those variables having nonzero partial correlation with some 
criterion. It is well known that in a classical approach, when sample sizes 
are small relative to the number of predictors, it is often best in a pre- 
dictive efficiency sense not to use all variables in the multiple regression, 
but rather to use some lesser number. In the Bayesian approach, unless one 
or more of the partial correlations is zero, it is always better to use all 
variables when the evaluation of the efficiency of the procedure is made 
with reference to the Bayes criterion and with the assumed prior distribu- 
tion. These apparently contradictory conclusions can be reconciled. Since 
a Bayesian analysis with uniform prior distributions used for all regression 
parameters is essentially equivalent to a classical analysis, it should not 
surprise us that the use of such unreasonable prior Bayes distributions leads 
to unreasonable frequency results. However, suppose that one assumes that 
the regression coefficients have been sampled from a population of regres- 
sion coefficients, and suppose that the prior distribution on the mean of 
these coefficients is centered at zero and the prior distribution on the 
variance does not place undue weight on infinite values. What we will then 
find is that our posterior estimates of the individual regression coefficients 
will themselves be regressed to the mean of the regression coefficients so 
that for small samples the Bayesian will have quite different estimates 
than will the classicist. Again, use of the Bayesian structural model will 
accomplish this end. For small sample sizes some of these regression co- 
efficient estimates may in fact regress very nearly to zero. As in the previous 
estimation problem, both Bayesians and non-Bayesians use prior informa- 
tion, but only the Bayesian explicitly quantifies this aspect of his work and 
only the Bayesian method permits the data to modify these prior beliefs. 
Perhaps this “explains” why the Bayesian uses all variables when a prior 
distribution is available. 


An Overview of New Developments in Testing Services 


The Bayesian regression model developed by Lindley is based on the 
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simple notion that the ability of persons and the grading and performance 
standards of educational institutions can be more efficiently estimated by 
taking into account our knowledge that particular values of these parameters 
will be highly related within homogeneous groups. The first application was 
to the estimation of true scores. Here a Bayesian justification was found 
for Kelley’s classical weighted-average regression-estimate of true score 
based on observed score and mean observed score in the population. The 
Kelley estimate in its Bayesian extension was found to be a weighted average 
of the observed score for the person and the mean observed score in the 
sample of persons tested with the weights being, respectively, the sample 
reliability of the test and one minus that reliability. In effect this estima- 
tion procedure is based on the notion that when the observed score is 
relatively unreliable an improvement in estimation can be obtained by 
using information about the mean value in the sample of examinees. The 
Bayesian estimate regresses the individual observed scores back toward the 
mean of all of the observed scores and this regression is large when the un- 
reliability of the test is large. 


This same idea was then used to provide a model for new guidance 
and selection methods for situations in which an external criterion is 
available. The standard situation is that one has information, in the form 
of test score and/or high school grades, about students who come from 
different schools and who express an interest in different colleges. Just as 
students exhibit different true ability levels on a particular test or school 
performance records, so schools and colleges have different grading stan- 
dards and hence differing difficulty levels which must be taken into account 
if accurate and unbiased prediction is to be accomplished. When one wishes 
to estimate the parameter values relevant to schools or colleges, it will be 
helpful to use expert judgment to group schools and colleges homogeneously 
and to use a Bayesian regression type estimate for each school and each 
college parameter with each parameter value being regressed back towards 
the average value of such parameters for all schools or colleges in that 
group. (Such groupings can, of course, be modified on the basis of subse- 
geeni One application was discussed briefly to illustrate the 
use s the formal model. We now describe other applications of this same 
should Our purpose is to indicate how the use of this model can and 
Wee hates should not) substantially affect testing practice in the very 
irade tae we shall be providing a brief prospectus of academic 

One modification of tes 
would involve the re 
instead of the actual 
to identify 
then regres: 


t test score reporting that might come to mind 
porting of Bayesian regression estimates of true score 
observed scores. We might be tempted, for example, 
each examinee as a student in a particular high school an 

s his observed test score toward the mean of the observed 
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scores obtained by persons from that school. This procedure, however, can 
be faulted on several counts. The reliabilities of most academic aptitude 
tests are very high in populations in which there is a broad range of ability 
levels, and even in restricted subpopulations they do not typically differ 
substantially from one group to another. When reliabilities are high the 
regression effect is so small that there is little point in regressing the 
estimates. Moreover, even if the reliabilities were not large the regression 
estimates would not substantially change the ordering of the examinees 
unless there was a substantial difference in the test reliabilities in the dif- 
ferent groups or the number of replications across persons varied. Since, 
again, these differences tend in practice to be small, there is little point 
in making these corrections. 

Furthermore, if the reliabilities were not large and if the mean value 
in the groups were more than trivially different, serious objections con- 
cerning the fairness of this procedure would need to be raised. As Robbins 
(1960) has pointed out, it is unfair to penalize a student by lowering our 
estimate of his ability because we know that he can be identified with a 
low ability group. Thus there is good reason to question the fairness of 
rejecting this student and accepting another student who did less well on 
the test simply because we identify the second student as coming from a 
high ability group. There is both common sense and theoretical justifica- 
tion for thinking that a student with an SAT Verbal score of 600 coming 
from a very poor school is, in fact, a better choice for many colleges 
than a second student with a score of 610 coming from a very good school. 
Thus the reporting of Bayesian regression estimates of true score both 
lacks significant virtue and has possibly serious defects. A resolution of the 
fairness question can be obtained only when the relevance of the test score 
to the pending action decision is taken into account. (See Cleary, 1968, 
for an intelligent discussion of the problem of bias in testing.) 

Testing organizations will undoubtedly be most interested in using 
test scores to provide comparative prediction of success at various colleges 
and within various curricula within a college. In this application the 
regression of grade point average on test score for each college and each 
college program is estimated and this is done without taking into account 
the high school affiliation of each student. 

The purpose of such an exercise is to arrive rationally at the kind of 
judgments now often being made on the basis of rather poorly gathered, 
poorly organized and poorly transmitted information on the overall diffi- 
culty level of various colleges or college programs and the particular traits 
necessary for success at these colleges and in these programs. The exercise, 
however, is not a simple one and requires a careful precise definition of the 
problem. Horst and his associates have been doing successful work along 
these lines at the University of Washington for many years. The new 
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Bayesian methodology, however, promises a substantial increase in the 
accuracy with which such predictions can be made. 


One precisely stated problem would be the prediction of first-year 
GPA’s at various colleges on the basis of academic aptitude test scores. 
By limiting the problem to the prediction of first-year grades, difficulties 
arising from differences in standards among departments within a college 
are minimized. At many colleges all freshmen take very much the same 
program so that there will be no interdepartmental differences. However, an 
attempt to predict four-year GPA must generally be done within depart- 
ments. Hoyt, at the conclusion of an extensive and careful study of fore- 
casting academic success in specific colleges (Hoyt, 1968), concluded: 
“Especially in complex institutions, a single prediction of academic success 
may be unsatisfactory since it ignores differences among curricula.” At 
the graduate school level it would undoubtedly be necessary to work field 
by field rather than across fields within school. The Bayesian method 
would be especially useful in this application because of the relative small- 
ness of individual programs. 


Guidance and selection problems for professional schools are particu- 
larly suited for treatment by centralized prediction methods. The relative 
smallness of the programs and the greater community of interest among 
the participants would make these programs ideal field laboratories during 
the developmental stage of a Bayesian guidance-selection project. Ultimate- 
ly, comparative prediction should be a continuing process which begins in 
the earliest years of education in the assessment of reading readiness and 
continues throughout a person’s active work years. 


Consider another rather simple guidance-selection problem. A single 
college selecting students from a fixed group of high schools. Suppose 
that studies have not been done relating high school performance to college 
performance but scores on a battery of tests are available on all students. 
A standard approach would be to relate such test scores to college perform- 
ance and to make tentative selections on the basis of the particular com- 
bination of test scores which appears to best predict college performance. 
This classic criterion-oriented approach to combining a number of test 
scores will generally prove superior to any single scale unidimensional latent 
my approach which attempts simply to order students on the basis of their 

intelligence,” presumably so that the more “intelligent” can be selected 
without regard to the peculiar character of the individual college. The 


Bayesian- i i i 
a an regression approach promises even further significant improve- 


4 In such situations the schools will often differ substantially in the 
ay Aas scores of their students and typically this will be concomitant 
with the general level of instruction and grading within the schools. Schools 
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that get good students can teach more than school that do not and in 
turn these schools can produce better trained students who will score more 
highly on tests than students from other schools. However, present level 
of training, in itself, is not necessarily an adequate predictor either of 
performance at the next level of training or in career potential. 

Again if one believes the possibility that a 600 student from a poor 
school may be a better potential selection than a 610 student from a good 
school, we need some formal mechanism for evaluating this hypothesis. 
A Bayesian differential predictability regression model provides the needed 
tool. If the model described in the previous section is used, the slopes and 
intercept of the regression of college performance on test scores are com- 
puted for each school as an estimate that is regressed from the usual least 
squares value back toward the average value among all schools. This accom- 
plishes two things. First it improves the overall accuracy of the estimation 
procedure, often even providing reasonably good estimates when only 
small amounts of data are available on some schools. Second, it permits 
differences in regression slopes and intercepts to emerge in a continuous 
fashion from the data as the amount of data increases. In this application 
such differences will not often exist, but when they do they will be very 
important. Recall that when no information is available about a par- 
ticular school the Bayesian regression estimate is the average value among 
schools, and when “infinite” information is available the Bayesian estimate 
is the usual least squares estimate. 

Now it may happen that clear differences among schools in slopes 
and intercepts emerge from a particular data set. For example the slope 
of the regression line for one “good” school may be less than the slope for 
a “poor” school. If it is less and if the crossing of these regression lines 
occurred in the region of obtainable scores, then for high scores the pre- 
dicted value for the student from the poorer school would be higher than 
that for the student from the better school. The opposite could, of course, 
be true. In any event this is something about which accurate information 
rather than speculation must be made available. If it were found that 
prediction was substantially poorer in some groups than in the overall 
population, there would be good reason both to examine the aptness of 
the training program and the criterion for this subgroup and to seek more 
effective predictors for this subgroup. More generally, differential pre- 
dictability methods can be used whenever students can be grouped in a 
meaningful way so that different regression lines are relevant in different 
groupings. 

Thus we have presented another simple situation in which the Bayesian 
regression model might make a substantial contribution to predictive effi- 
ciency and in this application it is difficult to anticipate any charge of 
unfairness against the method. The method is fair because it accurately 
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performs its function of predicting a person’s ability to succeed in a pro- 

program of study. It would be unfair to mislead a person by know- 
ingly furnishing him with overpredictions of his probability of success in 
a given program. Under certain circumstances, however, it might be argued 
that the accepting systems should only be furnished with predictions based 
on data from the entire population. Unfairness in its most objectionable 
form arises when the range of available programs is restricted so as to 
exclude from training some who can profit from further formal education 
and when selection for available programs is based on measurements that 
are not relevant to the prediction of success in that program or in the 
career opportunities to which this program leads. Just as tests must be 
tailored for different kinds of decisions so educational opportunities must 
be tailored to different kinds of people (Cronbach & Snow, 1969). One 
important outcome of differential predictability studies should be an in- 
crease in the variety of programs available to students. 


If the Bayesian comparative prediction and differential predictability 
models both prove useful with test scores as predictors, there is the pos- 
sibility of combining the two systems. This involves the use of a more 
complicated mathematical model than the one described above for use in 
comparative prediction or differential predictability alone. 


Another potential application of the Bayesian structural model is to 
the adjustment of high school grades to provide optimum prediction of 
college grades for various curricula and the adjustment of college grades 
to reflect intercollege differences in grading standards. The first major 
study involving such central prediction methods was that of Bloom and 
Peters (1961). This work has led to the belief that substantial increases 
in correlation can be effected by adjusting high school and college grades. 
Some later ‘Studies conducted by testing organizations (e.g., Lindquist, 
1963; Watkins & Levine, 1969) have failed to support the early promise 
of such methods. A review by Linn (1966) does not provide a favorable 
appraisal of such methods. However a very recent study by Cory (1968) 
does support one non-Bayesian method. 


It should be evident that a necessary condition for any adjustment 
tahoe to be of value is that there exist E EREA school HES, 3 college 
pores a possibly a large interaction effect (i.e. differing slopes within 
IER z ti 8). If both schools and colleges exhibit negligible be- 
from Prona a aoe hardly be surprising when no benefit is obtained 
inadd it oa ods. In such situations research suggests that the 

a gh school grade average is the best single predictor of college 
Fee nid better, typically, than academic aptitude test scores. When 
Fi ete © exist among high schools, the use of test scores rather than 
grades may largely do away with the need for student source a djustments. 
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Indeed this simplification has been a major reason for the existence of a 
testing industry. 


lege grading standards and differences in within-pair regression slopes do 
exist, classical methods can be less than useful when many parameters must 


College in Spearfish, South Dakota, but it is unlikely that there will be 
much data on students from a specific high school attending a specific 
college. 

This lack of data causes no insurmountable problem when the Baye- 
sian structural model is used because information on similar school-college 
combinations can be used to provide statistically optimum prediction 
weights even when no information is available on a particular school- 
college combination. Until such time as the Bayesian structural model has 
been used in situations in which school and college differences do exist, it 
would be premature to presume final judgment on grade adjustment meth- 
ods, particularly since they may be specifically the tools needed for what 
Turnbull (1968) calls the school-based system (see also Turnbull, 1970). 
In some applications it may prove helpful to use both test scores and high 
school grades differentiating both as to high school source and to pros- 
pective college choice. Work done by French (1963) and by Lunneborg 
and Lunneborg (1966, 1967) suggests that the use of course grades and 
academic aptitude tests as predictors of college performance can profitably 
be supplemented by other cognitive and noncognitive measures. 

In summary, the Bayesian methodology can be used to provide direct 
predictions of success based on test scores ‘and/or previous grades for one 
or more possible training programs and treating the applicant group as a 
whole or dividing it into relevant subgroups when appropriate. 


A Score Reporting and Guidance Service 


There are several features of the proposed score reporting and guidance 
service that should make it attractive both to students and school adminis- 
trators. These features are illustrated in the following and in Table 1. 


Score Report 
Academic Aptitude Test 
AAT Quantitative 
AAT Verbal Scaled Score 500 Scaled Score 600 
Percentile Rank 50% Percentile Rank 89% 
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Table 1 T 
Guidance Information . 


If Accepted 
Probability 
of 


University 


Probability |Probability First Year Attaining 
of of Predicted Degree in Predicted 

Acceptance |Completing Grade Point Normal Grade Point 

First Year Average Period 


Ivy League 

University 85-95 2.4-2.7 .60-.80 2.0-28 
Underdeveloped 

Area Technical 

College 90-.95 3.4-3.8 15-95 3.2-3.8 
North Atlantic 

State University .95-1.00 3.5-3.9 .80-.95 3.3-3.9 
Rocky Mountain 

State University 80-.90 3.0-3.5 65-.85 2.8-3.5 
Community Junior 

College 2.9-3.3 


Hopefully this report would also contain AAT subtest scores for diagnostic 
purposes and, when appropriate, scores on scales relevant to vocational- 
technical curricula. In a guidance rather than an admissions application 
the score report would have a similar form except that the column labelled 
university would be replaced by one labelled Program of Study or Training 
Program. 

; The low probability of acceptance for this student at Ivy League 
University reflects primarily the low selection ratio at that university. 
Note, however, that if accepted, his probability of successfully completing 
the first year is higher than at Rocky Mountain State University or at 
Community Junior College. This is not an unusual finding. Many highly 
selective universities are very protective of those students who are accepted, 
while many universities with open door policies let entering students fend 
for theniselves with the result that many marginal students fail for academic 
or other reasons, 
$ The negligible probability of the student’s acceptance at North Atlantic 

tate University and his certain acceptance at Community Junior College 
reflect only the fact that North Atlantic State University accepts almost 
no out-of-state students and that Community Junior College is required to 
accept all residents with high school diplomas. Differences in predicted 
GPA s largely reflect differences in curricula at the various universities an 
varying degrees of emphasis on verbal and quantitative skills together with 
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Th: more obvious differences in the overall difficulty levels at the various 
universities. Differences in lengths of prediction intervals largely reflect 
the amount of prior information on each of these universities. Note that 
the prediction intervals for the four year GPA are longer than for the 
first year GPA. To obtain more accurate four year predictions it may be 
necessary to differentiate among departments or major departmental group- 
ings (arts, sciences, business administration, etc.) since the requirements 
may differ among such groupings more than they do among universities. 
This could certainly be done more accurately after the student has com- 
pleted his first year of college and is beginning to select his major subject 
area. 

The preceding remarks make it clear that the accurate interpretation 
and successful use of the information contained in the guidance report will 
require precision and care. A heavy responsibility will fall both on those 
who prepare the explanatory materials that accompany the report and on 
the individual guidance counselor. It must be made clear to the student 
that the predictions made available and resulting from any particular 
pattern of test scores apply to the typical examinee receiving these scores. 
Another way of saying this is that the stated predictions will be good 
predictions for randomly sampled students from those attaining the par- 
ticular pattern of scores. These predictions, however, should be used only 
as a base line. The guidance counselor must emphasize that all predictions 
are based on data from groups in which there has been much self-selection 
on variables not measured in a validity study. Therefore, regression co- 
efficients determined by any method, Bayesian or other, must be treated 
cautiously for any particular applicant. The guidance counselor must bear 
the final responsibility for combining the information contained in this 
report with all other information on the student, taking into account any 
special knowledge that may be available on the student. He must be sure 
that these predictions do not push a student into a program that the 
student’s own self-understanding would indicate to be a poor choice. He 
must also look carefully for special qualities that a student may have that 
would make him particularly attractive to a college. In addition to this, 
the guidance counselor must be able to help the student understand his 
preferences and utilities and to combine these with the prediction so that 
the student can arrive at a rational decision. 

The score reporting form described above is unlikely to be appropriate 
in that precise form for any specific program, since it was designed to 
exhibit and emphasize the contributions of the Bayesian methodology. The 
precise composition of any reporting form will ‘depend on the specific re- 
quirements of an individual program. The thesis of this paper is that the 
Bayesian prediction methodology will make it possible for such programs 
to place heavier ‘emphasis on predictions methods. The major advantages 
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of the Bayesian prediction methodology described here are an 

accuracy of prediction when data are scarce and a resulting ability to discard 
obsolete data and thus keep up with current trends. Because this essentially 
clerical function is done centrally, guidance personnel are forced to examine 
the individuality of their own problems more carefully and to devote an 
increased effort to the task of helping each student formulate and under- 
stand his own goals and find possible means of attaining these goals. 


Implications for Test Construction Methodology 


As we indicated earlier, it is common practice in some testing pro- 
grams to report only omnibus verbal and quantitative aptitude scores 
despite the fact that it has long been recognized that human ability is 
multidimensional rather than two-dimensional. In most programs the 
verbal and quantitative scores are composites based on several different 
components of human ability. Parenthetically we might say that if either 
of these scales were unifactorial, serious questions would need to be 
as to their appropriateness. If we accept the fact that ability is multi- 
dimensional, it would seem strange to use a two-dimensional a 
aptitude test. 

Over the years the suggestion has been repeated many times that mul- 
tiple scale scores be reported on all academic aptitude tests. is would 
mean that those charged with guidance and selection responsibilities could 
combine these scores by using the particular regression weights appropriate 
to the prediction problem that concerns them. There has been a great 
reluctance on the part of the largest testing organization to do this. The 
objection has been that because of the relative shortness of the subscales, 
they would individually be relatively unreliable. Certainly they 
not attain the degree of reliability of the usual composite scores. 
has been that inexperienced test score interpreters would overinterpret and 
overemphasize any peculiarities of any of the individual subtest scores. 
This worry is a legitimate one. 

It has not been possible to break this impasse although, in theory, 
the individual reliabilities of the subscales are unimportant provided that 
the composite used for prediction (whether that be the usual unit-weight 
composite or a multiple regression composite) is reliable. The proposed j 
reporting and guidance service resolves this problem by centrally = 
mg the multiple correlation work based on subtests so that only predicted 
GPA’s and scaled total test scores need to be reported to the students. Baci 
of these quantities will typically have very high reliability. ; 

When prediction work is centralized, testing organizations will be 
encouraged to use a variety of item types in every verbal and quantitative 
scale. The particular choice of item types and the resulting composite can 
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be made relevant to the particular uses to which the scale is going to be put. 
Thus some tests could still ostensibly consist of omnibus verbal and quan- 
titative scales, but behind these would stand a varying multiplex of sub- 
scales selected and used through regression methods for the particular 
problems for which the test is being used. 

This does not mean that the idea of multiple score reporting need be 
abandoned. When it can be presumed that students will have adequate 
professional guidance in interpreting these scores, previous objections may 
be overcome. An important contribution of centralized prediction method- 
ology is that it lessens the necessity though not the desirability of multiple 
scale reporting. 

With the shift of emphasis from the estimation of ability to the pre- 
diction of performance there should be a simultaneous shift in test con- 
struction techniques. While such psychometric properties of tests as item 
difficulty and biserial correlation with total test score will remain impor- 
tant, they will need to be supplemented by questions of item-criterion and 
subscale-criterion correlation. The empirical validities of individual scales 
will become at least as important as their reliability. Subscale length, for 
example, would be manipulated by methods derived from Horst’s and 
Calvin Taylor’s work to maximize composite score validity (e.g., see Wood- 
bury & Novick, 1968; Jackson & Novick, 1967; Novick & yer, 1969a; 
Thayer & Novick, 1969) rather than reliability. The inescapable fact is that 
as essential as are considerations of the psychometric properties of tests 
for the establishment of psychology as a quantitative rational science. 
they are not sufficient for the assessment of “relevance in testing.” 
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MANPOWER PLANNING AND ITS ROLE 
IN THE AGE OF AUTOMATION 


Richard H. P. Kraft 
The Florida State U niversity 


The recognition of the growing interdependence be- 
tweet vocational training and higher technical education and industry is 
a major feature of the educational history of our times. Modern industry 
relies upon a level of skill and competence which is supplied through tech- 
nical education at various levels; and aanse T need the active 
support of industry if they are to supply these levels skills and competence. 
This interdependence is related to the research topic reviewed in this paper: 
namely, the kind of occupational training and technical education the 
American school system should supply, and the constant renewal and de- 
velopment of that education by changes in knowledge and in the manpower 
needs of industry. 

Given today’s relationship between manpower problems and tech- 
nological changes, it is rather alarming that technological 
have not become an area of primary research concern within 
— of education, nor is there a rere ar anya 
technological change on curriculum. nfortuna’ seems d 
little agreement pay interpretation of the term “technological change. 

“Technological change” is defined here in its more technically precise 
form; it considers two dimensions of change: (a) the technical dimension, 
and (b) the economic-social dimension. 


È 


what we can do. Research seeks out the practical and more or less prac- 
ticable. Technological change, however, reflects the actual adoption of 
new methods and products; it is the triumph of the new over the old 
in the test of the market and the budget. E i ; 

“Technological change” a from discovery is a complex, economic 
and social wer SKRA e ee by a range of decisions by business 
enterprises, labor organizations and workers, national and local govern- 
mental agencies, the educational system, households, and by the values 
and attitudes of the whole community. No single body makes a decision 
as to the rate of technological change in the society, no law can increase 
it by simple decree (Dunlap, 1962, p. 4). 
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For many years educators have ignored technological changes, in 
higher technical education and vocationally oriented training; they 
persisted in preparing students for a world viewed from an inherited, 
often locally-oriented outlook. Only recently have educators recognized 
the need for a positive attitude toward space-age technology; thus, con- 
structive ideas have been developed regarding the adjustment of vocational 
and technical curricula in order to prepare students for their future roles. 


Perhaps the most important theme of this paper is the sense of urgency 
concerning the measures and attitudes to be adopted by educators and 
administrators. The system of vocational training and higher technical 
education must be endowed with a capacity for change and innovation 
so it can adequately respond to the legitimate pressures and demands of 
modern society. 

Technological changes in the past few years have made the relation- 
ship of education to the American economy not only much closer than 
it was in earlier decades, but also more visibly related to the rate of 
economic growth and the life-time earnings of the labor force. One of 
the many aspects of the relation of the economy to the educational system 
lies in the connections between occupational structure and the size and 
character of vocational-technical education. As industry undergoes rapid 
changes in its occupational structure and as technological change and 
automation raise the skill level of jobs, the educational system must undergo 
a dynamic expansion. Obviously there are some connections between 
these broad developments. On theoretical grounds alone I am tempted 
to suggest that changes in the occupational structure of industry do have 
measurable effects on American technical education institutions because 
the new demand for educated personnel quickly transforms itself into 
higher enrollments. This development undoubtedly will be accompanied 
by higher costs, which is justifiable since many more people with highly 
developed skills and abilities will be needed and since the economy requires 
a work force which can adapt itself to constantly changing circumstances. 
As the economy requires a greater output of qualified manpower, it will be 
impossible to meet that requirement without having consequential changes 
and adaptions within the educational system. 


Automation and the Occupational Structure 


As far as the scope of this paper is concerned, it would be misleading 
to suggest that “neat” conclusions to critical issues will be developed 
which will improve the understanding of technological developments and 
their effect on the economic structure. 

However, the relationship between labor and technological changes 
should be of great concern to the educational decision-maker since he 
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must understand the implications of curriculum revisions in the light of 
technological changes and the far-reaching consequences of unemploy- 
ment. The introduction of new techniques of production eliminates some 
jobs (affecting labor requirements) and occupations (creating changes in 
skill levels). However, at the same time, new jobs and new occupations 
are being created. 

Current labor market data suggest that “there are basically no in- 
herent, long-term difficulties in the technological disemployment problem, 
provided responsible managements give warnings of employment changes 
or facilitate adjustments internally through retraining or transfer and 
provided a high level of aggregate effective demand is maintained by 
government through its fiscal and monetary policies [Gannon et al., 1967, 
p. A-146].” Thus, the economist with deep interest in education should 
be somewhat reassured that the most significant employment implication 
of automation is not mass unemployment but new areas of employment. 

Concerning the contribution of technological change to current or 
short-term instances of unemployment, the general level of unemployment 
needs to be distinguished from the displacement of particular workers at 
particular times and places. In a recent study Gannon (1967) wrote: 

Changes in the general level of unemployment are governed by 

three fundamental forces: the effective growth of the labor force, 

the increased labor productivity (i.e., output per man-hour) and 
the growth of total or aggregate demand for goods and services. 

The general level or aggregate demand for goods and services is 

the prime factor in determining the general level of employment 

and unemployment. 

Technological change affects all three of these major forces, 
but its main effect is registered (incompletely) through the rise in 
productivity [pp. A-101-A-103]. 

The basic relationships involved are qualitatively illustrated in 
following formula: 

Bo = (g&o + gu — dn) — Bu 
where gn = effective percentage growth in the labor force, 
percentage growth in effective demand for output, 


he 


o 


a = J 
Zp = percentage growth in average productivity, 

daù = percentage decline in total hours worked per year, and 
Zu = percentage of growth in unemployment rate. 


Gannon (1967) concluded: 


. . only when total production, (gp), grows faster than the rate 
of labor force growth plus the rate of productivity increase, does 
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the employment rate rise (gu increases), and hence the unem- 
ployment rate falls. For example, for the economy as a whole, 
if the rate of growth of productivity is 3 per cent per year, the 
labor force grows at 1.9 per cent per year, and average hours 
worked per year decline at 0.4 per cent per year, then from 
equation (1) above: 


gp = 3+ 19—04) —g 
Le, f = 45—g.... (2) 


Equation (2) above simply tells us that total output (and the 
aggregate demand to buy it), must grow in excess of 4.5 per cent 
per year just to prevent unemployment from rising [pp. A-101- 
A-103]. 


Focusing on automation and its effects on the occupational structure, 
one is forcefully reminded that a serious omission in the United States 
is that of government-sponsored research on predicting the future of 
machine counterparts as substitutes for human information-processing. 
Until recently, data on technological and economic availability of these 
counterparts has also been overlooked. Research in this direction will” 
provide the basis for building predictive instruments for future changes 
in occupations and job contents. Crossman (1964) remarked that only 
when a matrix of information processes and machine counterparts 
been developed can the forecasting of future changes in technology be 
undertaken. d 

Studies of specific responses which technological processes at the 
various stages of automation require of skilled personnel may provide 
skill information that is needed. A cross-technology investigation of re- 
quired responses will permit the identification of broad skill categories 
which could be used for developing suitable guidelines for vocati 
training and technical education (Davis, 1965). 


The Emerging Social Demand Concept 


The social demand approach dominates much, but not all, of the 
current educational planning work in the United States. Four definitions 
a social demand as abstracted from some of the recent writings in 
- are: (a) Social demand for education means the effective de 
or places in formal education. (b) Social demand for education is tt 
eminent need of the democratic society (present and future) for 
improvement of human capacity by formal and nonformal education: 
(c) Social demand for education is an expression of securing equal changes — 
for all individuals to get all the education they can absorb. Or, si > 
(d) Social demand for education means the demand derived from the i 
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principles of giving all individuals an equal opportunity to get all the 
education they ask for (Edding & Naumann, 1968, pp. 130-160). The 
usefulness of this approach for curriculum planning in vocational-technical 
schools, colleges, and universities is limited by the unknowns in the 
relationships between particular occupations and the education that they 
require. Changes in technological processes may require a change in 
the educational input for particular occupations, while changes in content 
and methods of education affect the educational input for relevant occu- 
pations. Factors such as the appeal which various curricula have upon 
students, e.g., preference for arts or science instead of engineering, necessi- 
tate a revision of forecasts, and the constraints in this sector may again 
lead to a revision of the curricula. 

In light of the persisting uncertainty inherent in educational planning, 
the only general conclusions that can be drawn from the social demand 
approach is that an appeal must be made to all educational decision- 
makers to adapt the structure, methods, and content of technical education 
to the new situation of fluctuating labor market requirements. 

Many industrialists believe that the status of vocational-technical 
education is changing at present. Some firms are quick to see that the 
educator is a valuable ally, but the attitude of others remains more 
traditional. Although industry as a whole is recognizing more rapidly 
that the efficiency of production is merely the efficiency of the producers, 
there is still the fear that the processes of education may bring 
some undesirable by-products. Many industrialists remember that educa- 
tion has a strong literary tradition, and that while it has educated men 
for responsible administrative positions, many educational institutions 
have either actively despised the skill of the profit-oriented manager or 
deliberately kept themselves in ignorance of the market forces and of 
economic laws. 

No one can deny that there is a cleavage between the academic world 
and the vocational world. The cleavage is evident in the incompatibility 
of the intellectual and the trade-union factions of the political parties; 
it occurs in education itself in the contrast between the “university” and 
the “state college” and between the various post-high-school vocational- 
technical educational institutions and the system of part-time vocational- 
technical education. These are all examples of an antithesis between the 
learned and the laborers that strongly affects all human society. The deep 
gap between “vocational” and “academic” is by no means a figment of 
the imagination. It is real, and a great number of educators and educa- 
tional administrators are deeply concerned that it might be widening. 

How is the gap between the “vocational” and the “academic” signifi- 
cant in the training and education of skilled labor in a changing labor 
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market? First, society must acknowledge that occupational training is a 
respectable role for post-high-school institutions, such as junior colleges. 
Sometimes it seems as if certain segments of the American system of higher 
education price themselves out of the market by unduly emphasizing their 
academic programs. 

The higher order tasks in American society become increasingly 
complex with each passing year, and at the same time the lower order 
tasks are being relegated to machines. The vast array of middle-order 
tasks will soon furnish the livelihood for the majority of American citizens. 
The development of area yocational-technical centers and junior colleges 
is dependent on how successfully they are able to solve the problems of 
education and training for these middle-level tasks. 

Somehow, the system of vocational training and higher technical 
education must provide a continuous educational spectrum to match the 
continuous occupational spectrum. For example, in many engineering 
colleges a trend has developed in which extreme specialization is avoided 
since many of these colleges regard the vast spectrum of jobs at the 
technical level as consisting of clusters of jobs. Curricula in these institu- 
tions are usually planned for one or more of these clusters. Typical job 
fields or clusters are: civil technologies, mechanical technologies, electrical- 
electronics technologies, and industrial technologies. Also, in intermediate 
technical education, surveys have found that technical jobs range actoss a 
wide spectrum; the extremes of the spectrum are (a) those jobs in which 
technicians work at quite highly sophisticated levels in research, and (b) 
those occupations that demand a great deal of manipulative skill and 
ingenuity with tools and equipment, but require only a modest background 
“3 a mathematics, and engineering theory (see also, Benson & Lohnes, 

The important point of this finding is that there are all kinds of 
technical jobs between these extremes. The gap between the professions 
and skilled trades cannot be filled by one kind or level of qualified per- 
sonnel. It is at this point that many educational planners and junior 
college administrators in charge of curriculum commit a grave error. In 
their determination to be “academically respectable,” they plan programs 
only for engineering technicians and raise the academic level until it 
scarcely differs from that of an engineering program in 4 college © 
engineering. Many administrators tend to defend this curriculum 9y 
arguing that the public image of American technical education is one IM 
which occupational training hardly belongs to the educational world at all. 
It is seen, instead, as a minor ancillary of the world of industry. 

Three points on occupational and educational relationships should 
be stressed. First, if the educational planner-administrator wants to adjust 
the curricula in response to technological changes in planning strategies ani 
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activities he must not only throw new light on the efficiency of the 
personnel policies of firms but also must take a comprehensive look at 
educational qualifications, the cost of education, and the problem of poor 
utilization of educated labor in various segments of industry. 


Second, educational planning, which involves the use of detailed 
occupational and educational data, must realistically review its outdated 
approach to rigid educational requirements for technical occupations. 
Research shows, for example, that in engineering jobs no single educational 
qualification or educational “avenue” stands out as the “optimum” educa- 
tion for that particular occupation. 


Finally, the administrator in charge of curriculum revisions must 
realize that firms invest in their educated labor in much the same way 
as their physical capital. Inquiries showed that large manufacturing firms, 
for instance, planned their use of highly qualified personnel over time in 
the same way as they planned their use of capital. These companies have 
recognized the utmost importance of predicting the rate of progress of 
automation and the accompanying changes in skill input. Within the 
framework of what sometimes is called “active labor planning,” these firms 
have already worked out plans to predict the employment at various skill 
levels that will be required in the future. 


Confronted with frequent, conflicting calculations about the future 
occupational structure of the labor force, the educational planner- 
administrator will have to solve the problem of translating the labor 
requirements by occupational categories into requirements by educational 
qualification. Undoubtedly, this constitutes a main difficulty since there 
seems to be little stable relationship between the occupation a person has 
and the schooling he has received. 

Davis is very concerned about solving this problem; he outlined 
some suggestions for the development of predictive instruments which 
might help the educational planner-administrator in initiating appropriate 
curricular changes. He separated short-term changes in occupations an 
skills from long-term changes. To obtain the necessary data he proposed 
an intelligence network which would consist of “information links with a 
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selected sample of representative employers, private employment agencies, 
unions, and governmental agencies. This intelligence network would pro- 
vide reports about changes in selected jobs and their contents [Davis, 
1965, p. 8].” | 


Davis (1965) further proposed: 


. . . this network would permit the development of comprehensive 
information on changing occupational employment patterns in 
individual industries. Continued sampling of jobs and tasks 
selected on the basis of an automation taxonomy and subjected 
to study will permit the identification of changes in skill patterns 
within jobs. As a predictive instrument, the short-term indicators 
can be tested when complete and comprehensive data are available 
at longer time intervals [p. 9]. 


For the educational planner-administrator, long-term changes in occu- 
pations and skills are even more interesting. Davis (1965) pointed out 
that the “identification of long-term changes requires the development of 
predictive instruments having cross-technology capability and linking 
technology with economic feasibility. This would require us to begin with 
a... formulation of an automation taxonomy [p. 10].” 

In an earlier study by Davis a quite different approach was used. 
Age-earnings-education profiles were constructed showing that the rate of 
monetary return was higher at the technician level than at the engineering 
level. Even though some of the data are inadequate, it is tempting to 
conclude that the large earnings-differential might well lead to a higher 
demand for educational services at the intermediate (technician) level. 
In view of changing skill profiles, the need for a better differentiation 
between appropriate training functions should be emphasized. This seems 
to be an urgent requirement in order for educational services to meet 
industrial needs (Kraft, 1968b). 


Change Is Accepted Too Slowly 


; The literature on the economics of education and technological change 
is plentiful, but there is an unfortunate shortage of relevant empirical 
material. Thus, recent research into education and occupation (Kraft, 
1968a) had two aims: to stress data collection and, as a consequence of 
the empirical aspects of this research, to formulate new conceptual tools. 

During the interview phase of the industrial depth survey of this 
research project, officials of a representative number of firms reported that 
they believe technical curricula must reflect the most up-to-date knowledge 
in particular subjects and that continuous course revision is required to 
account for the increase in the amount of knowledge and the rapid change 
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in its nature. At the same time, there is a limit to the amount of material 
which can be accommodated within courses. The extension of technical 
schooling has’ resulted from the awareness that man in modern society 
needs more basic knowledge but that extension in itself cannot fulfill 
man’s needs. This dilemma has reinforced the concept that the role of 
post-high-school vocational-technical education is not to offer even more 
knowledge, but to select from the vast stock of knowledge that which is 
essential. Such a technique should enable the student to develop the 
aptitude for acquiring and using knowledge on a continuing basis. 

To insure a receptive audience for new developments, the educational 
planner-administrator must cultivate the right attitudes in his faculty. 
If the educator accepts innovations, they will be more easily assimilated 
into the regular process of education itself. It is only in this way that 
teaching can become an instrument not only for the dissemination of 
knowledge, but also for its production, especially in higher education. 

An exploration of the awareness of technological changes on the 
part of industries’ officers and technical institutions’ educators and 
administrators revealed that, in most cases, the question of education 
and technological development had been given careful thought. However, 
until now the technological changes have not been of a kind to induce 
smaller and middle-sized firms to make any special investigation. They 
expressed the opinion that it was not possible to distinguish technological 
changes from other simultaneously influential factors behind movements 
in the manufacturing industry. 

The economists in the firms that were investigated agreed that there 
are no instruments to help predict the kind and extent of educational 
changes that will be necessary in the future. They felt that the lack of 
a systematic frame of reference has contributed to the issues regarding 
technological change and the ensuing curricular changes as they are 
affected by various broad policies and policy decisions. 

Almost all interviewees (90%) in this research complained that in 
vocational training and in technical education, change is accepted too 
slowly. Complete diffusion of successful innovations appears to take a 
decade after the first introduction. 

However, in defense of many outstanding post-high-school institu- 
tions, other representatives mentioned that the rate of acceptance had 
increased considerably in recent years. This acceleration can be observed 
not only in the introduction of primarily technical innovations, but also 
in organizational changes and in curriculum materials. 


Somers (1968) expressed a similar opinion and called for an analysis 
of procedures usually adopted in reaching decisions on the initiation of 
new vocational-technical educational programs. He reported: 
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. . . the established procedure for beginning a new course is for 
the school’s director or coordinator to utilize the services of an 
advisory committee, either a standing group or one appointed 
ad hoc for this purpose. The committees are to be composed of 
employer, union, and public members. Although the pressure for 
establishment of the course may initially come from the school 
staff, from a group of employers in the community, or from stu- 
dents who wish to enroll in such a course, it is the responsibility, 
first, of the advisory committee and then of the implementing 
school officials to evaluate the real present and future need for 
such a course on the basis of the best labcr market available. 


Having determined the need, the decision to go ahead will 
presumably depend on costs and available budget, and on such 
practical considerations as the availability of space and equipment. 
Once the school authorities are convinced of the wisdom of the 


new course, they must then persuade local and state education 
boards [pp. 53-54]. 


It is interesting to note that all interviewed representatives of industry 
) associated the major problems in vocational-technical education with the 
absence of appropriate mechanisms to initiate changes and with the need 
to develop attitudes which would make innovations more acceptable. It is 
largely as a consequence of recent changes in attitude towards vocational- 
technical education that educators at post-secondary institutions have been 
encouraged to think of educational changes as a continuous, rather than 
a periodic, process—a “rolling” adjustment to technological changes. It also 
seems to have been fully recognized by educators that scientific and 
technological changes not only affect the content of the materials, but 
also the attitudes and habits which should be developed. 


In view of possible revisions of the curricula, it is felt by the author 
that educators at vocational-technical institutions should think primarily 
of providing generalized basic courses rather than specialized subjects with 
currently fashionable names and content. Teaching more courses in 
mathematics and the physical sciences for instance will have to serve 
the needs of technological changes, including automation. 


A large number of educational planners-administrators in vocational- 
technical institutions and junior colleges felt that technical education 
need not, and perhaps in many cases, should not be aimed at meeting 
the technological changes which determine the manpower requirements 
of the various industrial groups. More than 90% of the respondents 
expressed their strong feeling that technical education—including the 
training of highly qualified technicians—should focus on establishing a 
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broad intellectual foundation which would enable the student to identify 
and solve problems he encounters at work. 

The opinions of the educational planners-administrators were con- 
trary to the opinions expressed by the first-, second-, and third-level 
supervisors and top-level industrial officers who were interviewed. Over 
70% of the respondents indicated that technical education “beyond the 
high school” should meet the specific needs of industry; and 60% of the 
interviewees added that short-term needs ought to be served by vocational- 
technical institutions. Thus, although they would generally like to see 
broad-based curricula in technical education, the employers in manu- 
facturing firms, transportation, communication, and public utilities 
expressed their need for specialists. 

Opinions of the balance between the “theoretical” and the “practical” 
side of technical education, for instance, undoubtedly vary at the numerous 
institutions in different states. In Florida, 80% of the faculty members 
interviewed at technical institutions stated that industry exerts a “certain 
amount” of pressure on these schools to adjust their technical curriculum 
to satisfy the specific training requirements of individual companies. 

Although the pressure exerted by companies in such circumstances is 
understandable, it is not justifiable because the effect of technological 
change is often unpredictable. Fortunately, such a sharp contrast between 
academics and technology is seldom emphasized in engineering education. 
Most of the respondents felt that the main task of colleges of engineering 
is to educate engineers academically and to make specific arrangements 
with industry so the students’ practical training will be “related” to 
their educational progress. While this combination of academic education 
and industrial training is deliberately designed for students who plan to 
make their careers in the manufacturing industry, it must not be designed 
to serve only the more limited goals of a particular industry or company. 


Reliance on Internal Technical Training Dangerous? 


During the past few years the engineering profession has been faced 
with increasingly new and extremely complex problems. These problems 
require educational program-planning. Manufacturing processes either are 
or are becoming extremely complex; technological advances require the 
young technician or engineer to have an education based both on 
engineering science and on the pure sciences. The scientific training 
of past years was founded on the pattern of slow evolution of individual 
development in pace with the existing transition rate from discovery to 
application. This pattern no longer exists. Thus Parnes (1963) wrote: 


. . . the coupling of this factor with the ever-increasing fund of 
knowledge results in an unquestioned need to reorganize training 
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methods to incorporate more of the scientific approach to 
engineering. This includes not only an increase in emphasis on 
fundamental principles and mathematical tools, but also instruc- 
tion in the use of these principles and tools in their application 
to engineering problems [p. 50]. 


More than 75% of the educators and administrators who were inter- 
viewed in Kraft’s study of manpower and occupation indicated that 
automation—or any other technological change—will be less likely to come 
as a tidal wave and more likely to come as a succession of groundswells 
that will reach different operations and industries at different times and 
with different impacts. The same staff members mentioned three built-in 
governors that should regulate the spread of automation in the manufac- 
turing sector so that it will not overtax the firms’ abilities to absorb it. 
The three governors are: (a) the technical limitations of the design to 
automatic applications, (b) the limited economic feasibility of automation, 
and (c) managerial inability to fully understand and take advantage of 
the opportunities which automation presents. 


In designing a “proper” program, engineering faculties find them- 
aG 3 a dilemma since their students engage in widely varied types 
of work. 


Since management recognizes the need for highly qualified technical 
personnel who are also trained in general management, much industrial 
management training is carried out internally by the larger manufacturing 
firms. Only 10% of the educators who were contacted expressed doubts 
about the quality of training offered in industrial institutions. The 
majority felt that, at present, certain firms can impart more knowledge 
to their technicians and engineering staff than can academic institutions. 
However, as the rate of technological change increases in manufacturing, 
contract construction, communication, and public utilities, to name a few, 
the need for more cooperation between those industrial groups and 
technical institutions should grow. 


Several engineering colleges in the Southeast have designed a core 
of courses in engineering science common to all engineering curricula. 
It was interesting to find that 50% of the interviewees saw great merit 
in emphasizing general principles, whereas the other half of the sample 
opposed the core curriculum on grounds that specialties should not be 
incorporated into a common course and taught to engineering students 
as a whole. Those who opposed the core curriculum thought that a more 
fundamental, or undergraduate, instruction would be desirable, but a 
single basic curriculum” would be unrealistic because of the diversity of 
sciences on which engineering practice is based. Several colleges of 
engineering were criticized by industry for offering courses or clusters 
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of courses with little or “no reference to the application of special knowl- 
edge in industry.” 

All large firms in the sample provide special technical training for 
their qualified employees. Only 10% of all company officials saw any 
danger in the reliance on internal technical training, but 65% of the 
academic staff members pointed out that there are two basic danger-zones. 
First of all, on-the-job training too often tends to be of a very narrow 
kind; secondly, since not enough new ideas are getting into the company, 
a large amount of information and knowledge may be given with little 
or no reference to technological changes. 


Predictive Instruments Needed 


There seems to be a general consensus among educators and educa- 
tional administrators that the need for a broad and well-structured technical 
education curriculum does not arise solely from humane idealism, but 
also from urgent practical economic needs. I think that the adjustment 
of the educational structure to technological change is an essential basis 
for any attempt to prepare this country’s youths intelligently for the 
educational tasks that lie ahead. 

As pointed out earlier, the effects of technological changes are by no 
means rigidly determined by technological factors. These set certain limits 
to the kinds of development that can occur, but within these limits there 
is enough room for considerable variation. Technological changes, thus, 
offer freedom of choice in such matters as curriculum changes and job 
design. From another viewpoint it can be seen as less advantageous since 
human inertia and the complicated procedures of changing an existing 
curriculum might prevent us from reaping the full benefit of these changes. 

This attempt represents a potentially serious philosophical conflict 
between the new manpower interest in education and the traditional view 
of education’s role in a democratic society. Under the traditional view, 
the purpose of education was to enable the individual to equalize his full 
human potentialities for his own sake; in the social demand approach, 
however, industry as well as cultural and public institutions have to be 
staffed with persons having the requisite education and skills (Coombs, 
1960). Specialists engaged in educational planning must consider this 
conflict carefully. One of their major tasks is to convince statesmen, 
educators, and educational administrators that this conflict is not irrecon- 
cilable and that the two educational objectives can be balanced. 

Available data indicate that technological change and, in particular, 
the development of automation, did not involve any serious consideration 
of a closer cooperation between industry and vocational-technical training 
institutions and schools of technical higher education. More than 20% of 
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all answers received from academic staff members indicated that the main 
function of technical education should be the development of fundamental 
knowledge, a role not easily reconciled with specific industrial require- 
ments. Some industrial officers (19%) were indifferent toward vocational- 
technical education; they had not considered seriously how vocational- 
technical education centers and colleges of engineering might assist them 
in acquainting future staff members with technological change. The present 
lack of interest by men in industry seems to indicate that only to a very 
limited extent do they recognize the possibility of influencing the curriculum 
structure. 

Predictive instruments may be capable of providing the educational 
planner-administrator with information having long-term implications. 
With these instruments the planning specialist not only would be in a 
position to identify the skills most likely to be replaced in future years, 
but also would be assisted in projecting long-term educational needs. 
Such forecasts would provide the needed support for the development of 
a long-range vocational-technical education policy. 

The economist who wants to assist educational administrators in 
decision-making needs predictive models suitable for testing. The develop- 
ment of such instruments should make it possible to predict the effects of 
technological changes on occupations. My position is that a mathematical 
model of technological change, i.e., a systems model, is necessary to make 
predictions. Such a model is not easy to construct because of the scarcity 
of explicit quantitative data on variables involved in technological change. 
In fact, many economists expressed the opinion that the derivation of a 
complete, closed, and predictive systems model is impossible. Nevertheless, 
more refined forecasting techniques, particularly long-term ones that are 
used in identifying the impact of technological changes on skill requirements 
and demand for labor, are needed; and, at the same time, a regular 
evaluation of the relevance of technical curricula to the educational input 
into the labor market is required. Such research would yield important 
information on which educational administrators could base further action 
relating to the formulation of occupational and educational relationships 
in order to better adjust the curriculum to changing industrial manpower 


needs. 
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ECONOMIC BENEFITS OF COLLEGE EDUCATION 


David R. Witmer 
Wisconsin State Universities 


In 1928 Walter S. Gifford read three reports on the economic 
benefits of college education; he then wrote an article for Harper’s Magazine 
(Gifford, 1928a) in which he said, the more education you have, the more 
money you make. His article created a sensation (Hutchins, 1936). Gifford, 
president of AT&T, was an important man on the national scene from 
the late 1920s through the early 1940s. In announcing that business wants 
the scholar, he was speaking for many men of affairs and verbalizing a 
major shift in the American folklore of values and a momentous change 
in higher education. 


Early Attitudes 
A century earlier the Yale faculty reported (Day et al., 1963), 


The public are undoubtedly right, in demanding that there should 
be appropriate courses of education, accessible to all classes of 
youth .. . (But) there are many things important to be known 
which are not taught in colleges, because they may be learned 
anywhere . . . (However) if suitable arrangements were made, 
the details of mercantile, mechanical, and agricultural education 
might be taught at the college . . . Practical skill would then be 
grounded upon scientific information . . . whatever a young man 
undertakes to learn, however little it may be, he ought to learn it 
so effectually that it may be of some practical use to him [p. 85]. 


Since the time of Benjamin Franklin (1761), Americans have valued 
a practical education. In 1828, applied arts at the college level were 
generally still matters contemplated for the future; by 1928 they had 
become a widespread reality. Whether the object of education is “to lay a 
disciplined and furnished foundation” as envisioned by the Yale faculty, 
the “growth” described by John Dewey, or something else, education is 
something of value. That value is subject to measurement. 

‘The American penchant for measuring education in economic terms 
can be traced to the early days of the Republic. “If the children of the poor 
were forced to attend school, they would not be free to work. In fact, 
Nile’s Weekly Register, published for businessmen, reported how much 
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wealth could be made for American manufacturers if all the children could 
be employed in the mills [Rippa, 1957, p. 106].” For 40 years prior to 
1928, the National Association of Manufacturers and other business groups, 
under the influence of writers like Leonard P. Ayres (1909), had criticized 
the schools and depreciated the value of a college education. Gifford’s 
article signaled a turning point. 


American educators, since Horace Mann (1949, 1952) in the 1840s, 
have vigorously defended the monetary value of education to society and 
the individual. The increased earning power of the educated individual, 
as reflected in the market, has been one of the most frequently used 
arguments. Economists have also been strong supporters +i education. 
Adam Smith (1937), the founder of classical, liberal (laissez faire) eco- 
nomics, condoned government expenditures for education. His contemporary 
ideological heir is Milton Friedman. Under the constraints of a greatly 
changed political economy, Friedman (1954, 1962) also condones the 
employment of tax resources for education, though he advocates a different 
system of support. Marshall (1961) observed that it would “be profitable 
as a mere investment, to give the masses of people much greater oppor- 
tunities . . . (for education) [p. 216].” More recently, Galbraith (1958) 
called for a massive shift in spending from the private to the public sector 
for purposes specifically including education. The list of economists could 
be expanded. Interest in the economics of education has been avid. Ellis 
(1917) listed more than 125 books and articles published since 1885 on 
the relationship between schooling and income. Clark (1940) claimed 
that by 1940 the total was more than 500. Since 1940 the annual rate 


of publication on this and closely related subjects has increased dra- 
matically. 


Pioneering Studies 


William Petty (1899), the father of modern economics, estimated by 
capitalizing the wage bill that the value of an Englishman in 1687 was 
£69-90. Farr (1853) calculated the value of a human being by capitalizing 
future lifetime earnings. Although there are occasional slips into the use 
of cost of production approaches, Farr’s procedure is widely followed by 
modern economists in measuring the value of education. 


As a follow-up to his reading, Gifford (1928b) directed the AT&T 
Personnel Department and Bridgman to investigate the matter more 
meticulously. Gifford’s early impressions were corroborated. Not only did 
education correlate positively with income, but students with high grades 
earned incomes far above the median for college graduates as a whole, 
while those with low grades fell moderately below. Using retrospective 
as well as contemporary data from questionnaires returned in 1928 by 
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30,949 male matriculants of the 12 classes of 1889-92, 1899-02, 1909-12, 
and 1919-22 at 50 land grant colleges and universities, Bridgman (1931) 
and his associates worked out the following significant conclusions: (a) 
Between 1904 and 1928 beginning annual median earnings of college 
graduates increased 100% while the cost of living went up 105% and 
industrial wages climbed 175%. (b) Annual median earnings described 
a smooth, upward curve which increased 260% between the first and 
tenth, 367% between the first and twentieth, and 413% between the first 
and thirtieth year following college graduation (B.A.). (c) Though the 
range of annual median earnings during the first year following graduation 
was narrow ($1300-1600 in 1928), men in professional (excluding teaching) 
occupations earned median incomes of about $5,900 annually in the 
fifteenth year after graduation while those in business, teaching, and 
farming earned, respectively, approximately $1,000, $2,000, and $3,000 
less. By the thirtieth year following graduation the range had widened 
even more. (d) During the fifteenth year after graduation, median earnings 
of college graduates in the industrial northeast were about $900 or 20% 
higher than those in the south and west. The $900 gap persisted through 
the thirtieth year following graduation. (e) Though they started at the 
same level, graduates in arts and sciences earned $600 more ($5,600 vs. 
$5,000) 15 years after graduation than graduates prepared in engineering; 
engineering students earned $1,000 ($6,900 vs. $5,900), more than arts 
and science majors 30 years after graduation. 

Bridgman recognized the contributions of education to citizenship and 
the enrichment of life, as well as the economic benefits of college and 
university education. His study was carefully organized and skillfully 
executed. He was aware that native ability, motivation, personality, and 
family connections, as well as general education and major program of 
study, largely determined occupational choice, earnings, and success. 
Postulating that age on entry to college was a measure of native ability, 
he uncovered a pattern which indicated that the younger the individual 
(i.e the higher his ability) as a member of his graduating class, the 
higher were his earnings. While this evidence of the effect of ability is not 
compelling, it was the best he could do with his data. 

Gorseline (1932) published the results of a questionnaire-based study 
of 370 white, American-born, blood brothers who had attended school 
in Indiana. Dividing brothers into two groups according to amount of 
schooling (i.e., the brother who had attained the higher level of schooling 
was placed in one group, the brother who had attained the lower level of 
schooling was placed in the other), Gorseline found no significant difference 
between the two groups in terms of age (36.90 years vs. 36.21 years), 
place of residence, inheritance, marital status, number of children, “luck, 
or “money losses due to bad health.” Claiming that the group with more 
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schooling had “slightly higher marks and slightly higher test scores,” he 
asserted that a mean difference in schooling of 3.45 years or 35% 
(13.31 — 9.86 = 3.45 / 9.86 = 35%) resulted in a difference of 34% 
or $515 ($2,015 — $1,500 = $515 / $1,500 = 34%) in income, at the 
median, in 1927. 

The economic importance of higher education was investigated by the 
Harvard economist Walsh (1935). Drawing data on the pecuniary benefits 
of education from five other studies covering the years 1926-32 and on 
the costs of higher education from the Survey of Land Grant Colleges and 
Universities and School and Society, he found, even after discounting 
benefits at the rate of 4% per year and implicitly including the cost of 
earnings foregone during schooling, that the benefits of four years of 
college education over those of high school exceeded the costs by $2,632 
($9,030 — $6,398) for women and $28,611 ($35,009 — $6,398) for men. 
For some five year or longer college programs, however, he found that the 
extra costs exceeded the extra private monetary benefits. 

Walsh believed, with respect to investment in college education, that 
individuals behave in a manner explainable in rational terms whether 
or not they consciously included economic considerations in their decision 
making. The differences between cost and benefit remained wide despite 
flaws in Walsh’s study: his costs were too high because he included the 
full cost of subsistence instead of only the extra cost of subsistence in 
college over the cost of room, board, etc., outside of college, and his 
discount rate may have been too low. The differences, he explained, were 
due to imperfections in the market, the effects of marriage, the unmeasured 
services of wives and mothers, unaccounted for social benefits such as 
improved citizenship, variance in the desirability of different occupations 
(travel, vacation, pleasantness, stimulation, “the good life”), prestige, 
altruism, humanitarianism, and “prizes” (ie. the possibility of earning 
exceedingly high incomes in some professions), the “artificial protection of 
some occupations” Cie., restrictionism), custom, sentiment, “the natural 
affectionate propensity of parents to pay for the college education of their 
children,” etc. Walsh did not attempt to quantify the effects of any of 
these factors. Since Walsh’s work, few concepts have been added to the 
subject field, and important points have too often been ignored. 

Friedman and Kuznets (1946), studying the incomes of doctors and 
dentists in independent practice, found that doctors earn much more than 
dentists. The authors partially explained the difference by noting that 
entry to medical training was institutionally restricted, e.g, by specific 
action of medical schools and practitioners. Though Friedman and Kuznets’s 
explanation raised a storm of protest, the storm passed. Harris (1949) 
wrote of restrictionism in the field of medical education as though it were 
a universally recognized fact. 
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Harris later committed the incredible error of predicting that the 
monetary value of college education was going to decline drastically relative 
to the value of high school education. The income of college graduates is, 
of course, affected by the supply of college graduates, but demand for 
college graduates, which also has its effects, has been rising. Costs can 
be expected to drift slowly downward (relative to costs of elementary and 
secondary schooling, and in constant dollars) due to economies of scale. 
Nonetheless, one can expect the (constant dollar) monetary value of college 
education to fall gradually relative to costs as the percentage of the 
population graduating from college increases. The percentage of talented 
high school graduates going on for higher education has been growing 
so rapidly that Americans seem to be in a transition to college-for-the 
masses comparable to the transition to high-school-for-the-masses in the 
first quarter of this century. On completion of the transition, say by 
the year 2000, the college degree will be required at the threshold to the 
same good employment as is the high school diploma now. The income gap 
between high school and college graduates can then be expected to be 
larger than it is today. Needless to say, Harris has abandoned his previous 
position. 

Using data from the 1950 census, Anderson (1955) and Zeman (1955) 
showed that the effect of education on income is substantially different 
for whites and non-whites in the United States. Miller (1962) corroborated 
their findings and went on to claim that veterans of World War II who 
did not receive scholarship aid under the GI Bill had only slightly lower 
average incomes than those who did, despite their substantially lower 
average educational attainments. 


Contemporary Concepts 


Although the authors failed to discount income benefits in recognition of 
time preference, they did subtract the cost of college education ($8,200) 
and the lifetime interest that the investment could earn in US Government 
bonds ($24,000). They found dispersion around the averages to be so 
great that even at the period of peak earnings, age 45-54, when college 
graduates averaged $7,907 per year and high school graduates only $4,519 
per year, one-fourth of the college graduates made less than the average 
high school graduate of similar age who did not go on to college. (‘This 
may be due to difference in the quality of education; also, perhaps too 
many individuals with relatively low intelligence get more schooling than 
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they should.) Glick and Miller noted further that graduation yields a 
bonus of about twice the increment realized by the average individual 
who starts college but does not finish. This, they suggested, indicates that 
persistence in school reflects a complex of capabilities and motivational 
factors which are conducive to success in the world of work as well as in 
college. 

Present value. Using the same cross-sectional data that Miller and 
Glick studied, Houthakker (1959) applied slightly different techniques 
(including an adjustment for mortality) and calculated the present value 
of lifetime income, before and after federal income taxes, at four different 
discount rates, He suggested that from the private point of view the 
discount rate should be the cost of borrowing to pay for schooling, and 
from a social point of view, the rate should be the rate of return on 
alternate investments. Houthakker also pointed out the appropriateness, 
in a growing economy, of superimposing an upward trend on cross sectional 
patterns of earnings. 


Internal rate of return. Schultz (1959, 1960, 196la, 1961b, 1961c, 
1962), the father of the modern human capital concept, identified educa- 
tion as a component of human capital and suggested that schooling pays, _ 
on the average, a better return than most other investments. He then’ 
measured (in constant dollars) the stock of education acquired via schooling, 
as of 1957 as the sum of the costs from 1900 to 1957, and found that 
college and university schooling accounted for 22% of a total educational 
stock in labor force members 14 years and older ($118 billion of $535 
billion). Cost is a questionable indicator of value, and as Schultz readily 
admitted, he did not make distinctions on the basis of age, nor did he 
allow for the obsolescence of education. Nonetheless, his work had a 
wide impact on educators and on economists. Specifically addressing 
himself to educators, Schultz calculated the rate of return on total 
investment in college education as about 11% per year. Finally, he 
calculated the effect of education on the national growth rate as accounting 
for 20% of the observed increase during the period 1919-57. 


The year 1962 was an active one for students of the economics of 
education. Research scholars made a number of contributions to under- 
standing the benefits of college education. Denison (1962, 1967) addressed 
the problem of explaining the sources of economic growth in the United 
States, an issue growing out of the prior work of Fabricant (1959) aval 
others. Boldly measuring factors which had been considered nonquanti- 
fiable (a practice which causes some physical scientists to view economists 
with incredulity and other social scientists to look on in envy), he ascribed 
23% of the growth rate of total real national income in the period 1929-57 
to education, and another 20% to the “advance of knowledge” and “change 
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in lag in application of knowledge.” He subsequently made similar 
computations for the period 1957-62 and found that 15% of the growth 
rate could be accounted for by changes in education. This essentially 
confirmed Schultz's findings and has great implications for those engaged 
in the explicit valuation of college education. 


With two grants from the Carnegie Corporation and a mandate from 
the National Bureau of Economic Research to perform a definitive analysis 
of the money rates of return on investment in education, Becker (1960, 
1964) labored from 1957 to 1964. His results included a unified, compre- 
hensive theory of human capital investment and the conclusion that 
investment in college eduction results in an annual private return of about 
9% to 11% and a social (total, including private) return of from 8% to 
20%. Lest this be too quickly dismissed as the mouse-sized issue of 
mountainous labor, it should be noted that Becker (a) made downward 
adjustments of from about 12% to 23% to account for the effects of native 
ability, motivation, and father’s occupation; (b) covered all college students 
—women, dropouts, nonwhites, and rural—not just white male, urban 
students; (c) adjusted lifetime income profiles to reflect secular growth 
patterns, and (d) ignored nothing of significance that had been done pre- 
viously. His publication Human Capital is quite properly considered a 
textbook. 


Using the 1950 census data for the year 1949 that Miller and Houthak- 
ker had used, but refining the analysis, Hansen (1963) demonstrated the 
superiority for realistic decision making of the internal rate of return over 
the additional lifetime income and present value approaches to measuring 
returns on investment in schooling. The annual return to total invest- 
ment, in college over high school, was 10.2% while the return to private 
investment was 11.6% before and 10.1% after federal income taxes. The 
private return after taxes (10.1%) was lower than the return on total costs 
(10.2%). Since society reaps other social benefits, as well as substantial 
state taxes, it is obvious that “the student pays more than his own way at 
the college level.” 

Market analysis. Hansen (1961, 1963, 1967), continuing his research 
by comparing the rates of return from investment in alternative types of 
professional education, found that the shortage of physicians and dentists 
has been decreasing since 1949. This is undoubtedly related to the sub- 
stitution of drugs and paramedical personnel for physicians and dentists. 
Hansen’s work is particularly noteworthy as a demonstration of (a) the 
use of rigorous economic analysis in place of manpower planning (e.g., the 
use of professional population ratios); (b) the indeterminateness of man- 
power projections; and (c) the variations in definitions of “shortage.” 


S17 


REVIEW OF EDUCATIONAL RESEARCH Vol. 40, No. 4 


Indirect, External, and Social Benefits 


Blaug (1965, 1966), in his thorough comparisons of the rate of return 
and manpower planning approaches to policy formation for higher educa- 
tion, concluded that the two methods are complementary but “by its 
failure to pay any attention to money costs and earnings, manpower plan- 
ning stands condemned as a brand of technological determinism.” Blaug also 
provided a comprehensive compilation of the indirect benefits of education: 
(a) the current spillover income gains to persons other than those who have 
received extra education; (b) the spillover income gains to subsequent 
generations from a better educated present generation; (c) the supply of a 
convenient mechanism for discovering and cultivating potential talents; 
(d) the means for assuring occupational flexibility of the labor force and, 
thus, to furnish the skilled manpower requirements of a growing economy; 
(e) the provision of an environment that stimulates research in science and 
technology; (f) the tendency to encourage lawful behavior and to promote 
voluntary responsibility for welfare activities, both of which reduce the 
demand on social services; (g) the tendency to foster political stability by 
developing an informed electorate and competent political leadership; (h) 
the supply of a certain measure of “social control” by the transmission of 
a common cultural heritage; and (i) the enhancement of the enjoyment of 


leisure by widening the intellectual horizons of both the educated and the 
uneducated. 


Though it contains little that is new, Economics of Higher Education 
(Mushkin, 1962) published by the US Office of Education in 1962 is still 
the most widely read book on the subject. Economics of Higher Education 
introduced the prolific, comprehensive, and complex writings of Bowman 
(1962a, 1962b, 1962c) to a wide American audience. Effectively bridging 
the gap between educators and economists, Bowman proposed a better 
method for estimating the active current stock of education capital and a 
more adequate formula for measuring total social returns to education. 
Calling for more attention to education’s role in facilitating external 
economies, she defined private, nonmonetary returns to include (a) the 
additional things an educated person can produce for himself (e.g., repairs 


tion, etc.); (d) nonmarket production that his education forestalls; and (e) 
enjoyments for which his education disqualifies him—(d) and (e), of course, 
are negative effects. In her view, nonprivate monetary returns are those 
general increases in national income attributable to education; nonprivate, 
nonmonetary returns are such things as (a) voluntary services to the 
community by the educated, who draw on skills acquired in school or on 
leisure which results from Positions traceable to schooling; (b) reduced 


518 


Wires FCONOMIC BENEFITS OF COLLEGE EDUCATION 


delinquency rates; (c) psychic income attributable to having educated 
neighbors; and (d) a negative psychic dissatisfaction flowing from aware- 
ness of socio-economic failure or negative status. Though she eschewed 


search. In this publication, Schultz (1962) and Weisbrod (1962) both 
mentioned redistribution of personal income as an equity serving benefit 
of education, and Weisbrod discussed external benefits in sufficient detail 
to illustrate that they are not all in a broad, amorphous form equally 
affecting everyone. 

Weisbrod (1964, 1966a, 1966b) devoted considerable effort to the 
identification and quantification of the external benefits of education. Al- 
though there is disagreement over whether they can be systematically 
identified and measured, the following external benefits flow iin oa 
education and ought to be considered by any prudent decision ¢ 
higher levels of political participation, higher levels of tax revenue, higher 
levels of mobility, intergenerational gains through the informal education 
the children of college graduates receive at home, lower levels of unem- 
ployment, and leadership in maintaining political democracy and a free 
market system. 


Aptitude, Motivation, and Schooling 


Wolfle and Smith (1956) found that rank in class (presumably re- 
flecting aptitude and motivation) was a better predictor of higher income 
than intelligence test scores (aptitude alone), and that father’s occupation 
had little predictive value. They also reaffirmed the common opinion that 
“many high school graduates who are intellectually qualified do not under- 
take college work [p. 232]” and “it is of some financial advantage for a 
mediocre student to attend college, but it is of greater financial advantage 
for a highly superior student to do so [Wolfle, 1960, p. 179].” 

On a related point, Denison (1964, 1967) assumed that 60% of the 
income differentials that appear when men of similar age are classified by 
years of education results from the investment, productivity increasing 
effect of education, and 40% from native ability, intelligence, diligence, 
motivation, energy, race, inherited wealth, family status, etc. Later, on 
the basis of his analysis of the Wolfle-Smith data and in response to 
doubts expressed by Malinvaud, Sandee, and others, he raised his estimate 
of the effect of education as opposed to native ability, motivation, etc., to 


519 


REVIEW OF EDUCATIONAL RESEARCH Vol. 40, No. 4 


a range of from 67% to 81% of the differential earnings of college graduates 
and high school graduates, 


Hunt (1963) applied multiple regression analysis to data from the 
1947 Time magazine study. Reflecting on the results of nine iterations, he 
concluded that about 50% of income differentials was attributable to 
education and 50% to ability, quasi-rents (caused by unforseen shifts in 
supply and demand), differences in nonmonetary income, restrictionism, 
work experience, returns to sapital, and motivation. Hunt's research re- 
vealed that the degree of self-support while attending college, the prestige 
level of the college, participation in extracurricular activities, and the 
educational attainment of the student’s family all had negligible influence 
on postcollege income. 


Using data on about 7,000 male college graduate employees of AT&T, 
Weisbrod and Karpoff (1967) re-examined questions concerning the Mone- 
tary Returns to College Education, Student Ability, and College Quality. 
With class rank as a proxy for ability and motivation, they compared a 
number of subgroups under a variety of assumptions and concluded that 
from 69% to 82% of the earnings differentials between high school and 
college graduates are attributable to advanced schooling. Although they 
could not determine whether variations were due to differences in the 
quality of the college or to the characteristics of the students, they found 
that earnings increased consistently with increases in class rank and college 
quality (the latter subjectively estimated). Finally, explaining that invest- 
ment in higher education provides better returns to individuals who are 
capable of achieving high class rank, which in turn depends on the quality 
of college attended, they provided guides to help students improve on the 
traditional rule-of-thumb, i.e., enroll in the best (most difficult) college 
which will accept you and within which you can rank in the upper third 
of the graduating class. Whether students can estimate their future per- 
formance levels at colleges of various qualities with any reasonable degree 
of accuracy is questionable. Nonetheless, this paper is worth noting for 
its contribution to the “schooling vs. other factors” issue. 


In this survey of the literature on the benefits of college education, 
those studies which illustrate the development and application of theory 
which will be useful in further study have been cited, summarized and com- 
mented on, while indulging a bias toward pioneering efforts and away 
from simple reiterations. One question (namely, what share of income 
differentials should be assigned to college education) has been so intrac- 


table, however, that it has seemed advisable to include reference to every 
major study that has touched on it. 


The standard bibliography in this field is Economics of Education by 
Blaug (1966). For continuing reports on major advances in this and re- 
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lated topics see The Journal of Human Resources: Education, Manpower 
and Welfare Policies, published by the University of Wisconsin Press. 
Journals Department, Box 1379, Madison, Wisconsin. 
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RECENT APPLICATIONS OF COMPUTER 
TECHNOLOGY TO SCHOOL TESTING 
PROGRAMS' 


Elinor M. Woods* 
Boston College 


This review contains a summary of the principal appli- 
cations of computer technology to school testing programs reported during 
the five-year period ending December 1968. Readers are referred to Cooley 
and Hummel (1969) for a comprehensive review of current efforts to 
apply computer systems techniques to guidance. 

There are three journals which have regular computer sections on 
computer applications in education and psychology. An excellent source for 
programs developed and applied to test analyses is the section included in 
the spring and autumn issues of Educational and Psychological Measure- 
ment. The computer section in Behavioral Science published bimonthly, 
although it mainly focuses on programs developed for research purposes, 
also reports on programs developed which are appropriate to psychometric 
procedures. The Nation’s Schools includes a computer section which focuses 
on school data-processing applications. Other periodicals devoted exclu- 
sively to computer applications in education are as follows: AEDS Journal, 
AEDS Monitor, Automated Education Handbook and Newsletter, Data 
Processing for Education, Journal of Data Education, and Journal of Edu- 
cational Data Processing. 


General Applications to Testing Programs 


The references in the literature to computer applications to school 
testing programs tend to be brief listings of general uses rather than de- 
tailed descriptions of specific applications. The general uses made of data 
processing by school testing programs in handling test data commonly 
reported are: computations of raw scores; sacled score conversion from raw 
scores; computation of local distributions and norms; computation of 
national norms or comparisons thereto; class rosters of test results; group 
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averages; item analyses of tests, including difficulty level, percent of correct 
answers as a function of total score, and inter-item correlation and cluster 
analysis; reliability and validity studies, including locally constructed tests; 
prediction and expectancy studies; gummed labels for posting to permanent 
records; profile reports to counselors, teachers, and parents; and develop- 
ment of banks of items with known statistical properties (Bushnell & Allen, 
1967, pp. 213-214). An example of the application of these computer uses 
to school testing programs is provided by the three-year Richmond (Cali- 
fornia) Pilot Project in Educational Data Processing (Bushnell & Howe, 
1964; Calvert, 1966; Goodlad, O’Toole & Tyler, 1966; Grossman & Howe, 
1964; LaFleche, 1965; Wogaman, 1966a, 1966b). Tondow (1968) sum- 
marized the organization and development of computer services for the 
Palo Alto (California) Ugitied School District, in which the services listed 
for the area of testing closely parallel the foregoing list. The New England 
Educational Data Systems (NEEDS) provides another example (Bushnell 
& Allen, 1967; Goodlad, O’Toole & Tyler, 1966). The Educational Records 
Bureau in New York City performs the scoring and analyzing of aptitude 
and achievement tests for approximately 1,000 member schools—indepen- 
dent colleges and private and public schools (Computer Sorting, 1967). 
Rutherford and Koplyay (1968) summarized the institution of a testing 
program in one of Chicago’s North Shore school systems which provides, 
in addition to local scoring and analysis, an opportunity to make both 
within-school and within-system comparisons. 


Additional References: Anderson (1962, 1966a, 1966b); Computer Services 
(1963); Ferguson (1965); Findley (1963); Gerard (1967); Goodlad, Caf- 
rey, O'Toole & Tyler (1965); Goodman (1965); Grossman & Howe (1965); 
Haga (1967); Kaiser (1966); Raker (1967); Sinks (1964); What Can 
Computers (1966); Whitlock (1964); Wogaman (1966). 


General-Purpose Test Analysis Programs 
Aptitude and Achievement Tests 


The following are illustrative of the general-purpose test analysis 
programs reported. Moore and Schutz (1967) described a computer pro- 
gram which includes several output options: test statistics (mean, standard 
deviation, Kuder-Richardson 20, standard error measurement, frequency 
distribution); converted scores (percentiles, stanines); and item statistics 
(percent and frequency of responses to each choice for each item, item 
difficulty index, item discrimination index). Evans and Westbrook (1968) 
developed a program for computing item and test statistics which adds 
several features to the latter program. Lewy and Crawford (1966) outlined 
a program designed to score a complex test battery and to provide individual 
Scores, test statistics and item data for each subtest separately and for the 


526 


woons APPLICATIONS OF COMPUTER TECHNOLOGY 


battery as a whole. Raw scores, percent scores, weighted scores, and T- 
scores may be obtained for each of the subtests and total test, and an 
intercorrelation matrix is included as part of the output. They noted that 
a test consisting of 361 items classified into 15 subtests taken by 196 
subjects was scored in 2.96 minutes. Aleamoni et al. (1967) outlined a 
program designed to score multiple-response tests, with item weighting as 
an option; to compute test and item statistics and inter-item correlations; 
and to perform a principal components factor analysis with a varimax 


rotation. 


Psychological and Interest Inventories 


A computer system was developed by Iker and Harway (1965) to 
meet the need for reasonably reliable, accurate and efficient content analy- 
sis in the area of psychotherapy. This system can be used for psychological 
tests, such as the TAT and the Blacky, which produce a large amount of 
verbal material, to determine common factors within the test for an in- 
dividual. Gorham (1967) outlined a computer-based scoring system for 
inkblot responses. For the analysis of multiple-response inventories, Adel- 
man and McWhinney (1966) reported a program which computes a series 
of individual factor scores and norms for the entire sample. 

Goldstein, Linden and Baker (1967) described a program written to 
score and norm tests such as the MMPI or SVIB, or any test in which a 
dichotomous-response format is employed. Smith, Caras and Cohler (1966) 
designed a program which consists of a chain with two links, each respon- 
sible for one of several operations involved in scoring and making certain , 
actuarial decisions concerning the MMPI. The first link scores the raw 
data according to the usual clinical scales and a number of special re- 
search scales and produces raw and T-scores, but uncorrected and K-cor- 
rected. The second link applies the Meehl-Dahlstrom rules for determina- 
tion of psychopathology and the Marks and Seeman rules for determination 
of the profile configuration. Statements about the person’s scores on the 
several clinical scales are made. The program is reported to have processed 
400 subjects in 12 minutes. Another computer system which scores and 
analyzes test protocols on the MMPI was described by Fowler and Marlowe 
(1968). ‘The system’s output includes raw and T-scores for validity, clinical 
and special scales; printout of critical items answered in the deviant direc- 
tion; a graphic profile of scores; and a narrative interpretive report. A 
validation study of the interpretive computer reports with individual clinical 
reports of 2,000 cases over a four-year period was reported. 


Additi + Aleamoni & Spencer (1967); English & Kubiniec 
(1967): aaa Jones (1966); Jones, Pullias & Michael (1965, 1967); 
Levonian (1964); Nichols & Tetzlaff (1965); Olson & Royce (1967); Swan- 
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son & Dunn-Rankin (1965); Wuebben, Timmermans & Timmermans 
(1967). 


Classroom Applications 


The significant contributions of commercial testing agencies to the 
scoring and analyzing of standardized tests for school testing programs need 
no documentation. The Measurement Research Center (MRC) processed 
in 1967 over 10 million answer sheets for comprehensive test batteries 
averaging over 32 pages each (Lindquist, 1968). Most major test makers, 
while giving the classroom teacher the option of filling out individual stu- 
dent profile charts, can furnish a machine-produced graphic picture of 
the student’s overall level of achievement in relation to either his age or 
his grade group. Examples of individual profile charts prepared’ by major 
test makers are provided by the California Individual Test Record (Cali- 
fornia Test Bureau); the Metropolitan Individual Profile Chart (Metro- 
politan Achievement Tests, 1960); and the SRA Pupil Profile (Thorpe, 
Lefever & Naslund, 1962). 


Individual Applications 


Maberly (1964) pointed out the myth that school testing consists 
solely, or predominantly, of centralized programs utilizing a variety of 
standardized instruments; he asked: “What of the hundreds of thousands of 
teachers who daily prepare their little quizzes, their mid-term tests, and 
their final examinations? Are these not also a vital part of the evaluation 
process?” Although Goodlad, O’Toole and Tyler (1966, p. 99) lamented 
that many persons were too busy actually applying electronic data process- 
ing to keep records of the successes and failures attending thereto, a number 
of the applications reported in the literature are appropriate to the con- 
struction, scoring, analysis and evaluation of classroom tests. The cost 
of an optical scanning test scoring machine and computer time is negligible 
in comparison to the cost of man-hours that would be required to analyze 
items and tests. With the availability of test and item analyses, weak- 
nesses in item construction and in the individual student’s knowledge are 
being uncovered. Rosinski and Hamilton (1966) noted that the wealth of 
information provided by test and item analyses has introduced a spirit of 
self-criticism, not only in regard to the construction of classroom tests but 
also with respect to the teaching of each subject. 

Weisbrodt, Starry and Rock (1964) reported that the Measurement 
and Research Center at Purdue University uses the IBM 1401 and 7090 
computers as an integral and vital part of its program to assist faculty 
members in the upgrading of classroom examinations. The program was 
designed to stimulate instructor interest in test development techniques by 
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routinely providing essential evaluative information in conjunction with an 
efficient test scoring facility. The Pennsylvania State University is using 
its computation center to process test results and analyses in an attempt to 
improve instructional effectiveness through the improvement of examina- 
tions and examination systems (Dick & Spencer, 1965). The use of several 
total-test and item-analysis computer programs was reported, and the swift- 
ness with which these computations are performed by the computer was 
indicated by the example cited of three total tests, each with 100 items, 
being processed in 1.5 minutes for a total of 652 testees. The test scoring 
service and statistical laboratory at Colorado State University developed 
a series of programs designed to provide (a) accurate and efficient scoring 
of tests, (b) item analysis to improve test construction, (c) individual re- 
ports to students and faculty, (d) references to areas to be reviewed, and 
(e) maintenance of grade registers for large classes and computation of 
final grades using computers (Miller, Doig & Milliken, 1967). One pro- 
gram computes the correlations between test items and total test scores, 
rejects all items below a specified correlation coefficient, and recomputes 
each student’s score based on the items retained. 

Altenburg, King and Campbell (1968) developed a program for grad- 
ing answers to questions on chemistry laboratory reports for which one or 
more previous answers are required as input data, The student’s first inter- 
mediate answer is compared with what he should have obtained from his 
data: the student’s second intermediate answer is compared with what he 
should have correctly calculated from his reported first intermediate answer, 
whether or not the latter is itself correct. This process continued with suc- 
cessive intermediate answers through the final answer. Figerio (1967) re- 


oped a program which allows more than one answer to each question and 
kapelas | of partial credit to one or more answers. Rippey (1968) pre- 
pared a program to score and analyze probabilistic tests which require stu- 
dents to assign probabilities or weights of preference to each response. 
Clawar (1967) outlined a program which gives for each pupil three types 
of observations reported in the form of histograms: (a) pupil vs. class 
median comparisons, (b) pupil vs. norm population median comparisons, 
and (c) intra-individual comparisons (differences among all possible pairs 
of subtests for a battery). Test scoring programs which offer the teacher a 
flexible basis for assigning letter grades to a test or to an entire course were 


reported by Maxwell (1967). ; 
Additional References: Anderson (1967); Asprey (1965); Computers on 
Campus (1967); Cuttitta (1964); Flynt (1963); O'Malley and Stafford 
(1966); Zinn (1965). 
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Schmitt and Alterman (1968) noted that, although the computer 
made the process of multiple regression analysis feasible, following comple- 
tion of the basic analysis, additional computations are required to produce 
useful data for working with individual students. Even the teacher or 
counselor with relatively well-developed computational skills does not have 
time to produce predictive data in usable form. The authors outlined a 
method for computer production of nomographs for bivariate prediction. 
The nomograph provides graphic representation of a bivariate regression 
equation and permits ready association of the predicting scores with the 
predicted score for any student. The nomograph is produced by a Calcomp 
Digital Plotter from instructions generated by a computer. Input to the 
computer for the generation of plotting instructions consists of the regres- 
sion equation coefficients and constant, the score values of the prediction 
lines desired, and the labels to be drawn on the nomograph. 


Additional Reference: Pierce (1967). 
Cooperative Applications 


An excellent example of a cooperative computer-based test develop- 
ment system is provided by the COMBAT (COMputer Based Achievement 
Testing) System. This system was cooperatively initiated in 1967 by the 
Metropolitan Area Test Program Board (representing 67 school districts 
in the Portland, Oregon Metropolitan Area), Teaching Research (a divi- 
sion of the Oregon State System of Higher Education), and the Portland 
Public School District. COMBAT is essentially a test item pool, featuring 
a computer software system that provides convenient retrieval of items by 
teachers specifying the subject matter and grade level of items desired. 
It is projected that by July 1, 1970 the item pool will contain 41,000 items 
in such specific areas as social studies, science, language arts and mathe- 
matics. The item pool contains five kinds of items, true-false, multiple 
choice, matching, short answer and essay. The test items are constructed 
through the joint efforts of classroom teachers and supervisory personnel of 
the districts. In using the system a teacher identifies by phone the specifi- 
cations for the test. The computer is then instructed to search the pool 
for all items meeting the requirements stated in the test request. Once all 
the appropriate items are identified, the computer randomly samples through 
this sub-pool until the number of test items necessary to fill the test 
request are identified. Typically, a printout of the items and a scoring key 
is returned to the teacher for editing and approval. The approved items 
are printed by the computer onto duplicating masters. 

COMBAT is a cooperative venture in which the constituent districts by 
pooling their staff resources are able to mutually finance a program which 
provides achievement and ability surveys at appropriate grade levels, de- 
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velopment of local norms, measurements of individual class performance, 
diagnosis of specific kinds of student learning problems, reporting of test 
results, and a wide variety of other statistical and interpretive services. 
One of the most important aspects of the COMBAT system is its poten- 
tiality for the development of tests to fit a specific local curriculum. Ulti- 
mately a teacher will be able to identify a specific instructional area at 
any grade level and have an appropriate test custom built to fit the 
specifications provided. The program is presently in the initial stages of 
working toward complete implementation of the system. As indicated by 
the “COMBAT” reference, no materials regarding the COMBAT system 
have been formally published, but some descriptive mimeographed ma- 


counting machine for the construction and reproduction of classroom tests 
utilizing an item pool. Stodola (1965) outlined an item pool procedure 
which assigns a number code based on a detailed content analysis of 
course objectives and on the statistical characteristics of the items which are 
punched on cards together with answer selections. Through use of the code 
system, tests are constructed by selecting from the item pool items possessin, 

the specific content and statistical characteristics desired. The punched 
cards for the items selected are processed on an accounting machine to 
print duplicating masters. Woodson (1968) reported the use of a com- 
puter for random selection of items from an appropriate item pool and for 


constructed from an item pool with the same or â similar difficulty rating 
to be given to different sections of the same course meeting at different 
hours, or even to give a different test or tests to students seated next to 


Schmitt et al. (1966a, 1966b) found the computer was essential to the 
cooperative development of a series of unit achievement tests. Item analyses 
were performed by the computer on the initial forms of each unit test in 
irical bases for item ordering and revision. This 


would be practically impossible if it had to be done by hand, but which 
was accomplished by computer in less than five minutes per test. 


Future Applications 


Psychometric Evaluations 
Helm (1967) outlined a project under way at Educational Testing Ser- 
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vice to have a computer produce written evaluations of psychometric test 
materials. One result of this research was the development of a special 
computer language called PROTRAN (profile translator). A psychologist 
prepares a set of specifications indicating the kind of things that he wants 
the computer to say and the conditions that must be satisfied if they are 
to be said. It is possible to build up very complex hierarchies of interpre- 
tations in a natural way. In one application involving a complex set of 
about 100 specifications for a profile of 70 scores, the computer produced 
a paragraph of verbal interpretations in about 2 seconds. 


Finney (1967) described a program which is being developed to inter- 
pret a full battery of tests taken by an individual and compose a full report 
on the individual’s personality. The computer reports are expected not 
only to show external validity but also to be written so that it cannot be 
told by reading them whether they were mechanically composed or written 
spontaneously by a capable psychologist. At the stage of development 
reported, the program had been written to prepare reports on two well- 
known tests—the MMPI and the CPI. Some defects and even a few con- 
tradictions were found, but the overall results were promising. 


Test Branching 


Educational Testing Service (Cleary et al., 1968a, 1968b) is conducting 
feasibility studies on the use of programed tests in which a sequential 
system of branching is used to direct a subject to items that are appropriate 
to his performance level. The application of programed tests in practical 
situations would be facilitated by computer-based testing. Although this is 
presently an expensive testing procedure, improved computer hardware is 
making it increasingly more feasible. Computer-based testing has the 
advantage of far greater flexibility than usual testing procedures, and it 
may have the ability to provide immediate feedback to the examinee. 
Cleary et al. (1968a) noted that the use of computer-based testing could 
eliminate mass test administrations. Each person could be tested on an 
individual basis at his own convenience and speed at a testing console. 


Kleinmuntz and McLean (1968) are seeking to develop a method to 
permit a diagnostic interview to be conducted by computer. They propose 
that a subset of items from the MMPI clinical scales could be given to an 
individual at a computer console. On the basis of the individual’s responses 
to these items, certain hypotheses would be formulated by the computer 
program regarding the specific psychiatric dimensions worthy of further 
exploration. In practice, this would mean that instead of administering all 
550 MMPI items to an individual, after the program determines from the 
initial set of items that only the dimensions of schizophrenia and paranoia, 
for example, are relevant, the computer would be programed to branch 
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to highly discriminating schizophrenia and paranoid items borrowed 
from a number of well-established and empirically validated personality 
tests. A study to obtain an approximation of the goodness of fit between 
an MMPI administered conventionally and by the computer-controlled 
branching system was reported. Although the goodness of fit between the 
computer-administered and the conventional long form of the MMPI was 
not encouraging, further work was planned. 


Additional References: Bare (1966); Caffrey (1967); Janssen (1966). 


Simulating Test Norms and Tests 


Fontes (1968) reported the development of a model for the production 
of simulated achievement test norms. The regression of item response upon 
a standard measure of scholastic aptitude is employed as the basis for a 
Monte Carlo procedure which utilizes known ability characteristics of a 
population of interest to simulate individual (and, cumulatively, group) 
performance on a newly constructed test. A trial of the model is pre- 
sented, using total score on the School and College Ability Test (SCAT) 
as the ability index and selected items from five unit tests in high school 
biology to constitute the novel test. Different samples are used for develop- 
ing the input statistics for the simulation and for testing the adequacy of 
the pseudonorms produced. 

Richards (1967) described a computer simulation procedure developed 
for writing verbal comprehension items. The procedure was used to con- 
struct a 72-item test which, together with the Wide Range Vocabulary Test, 
was administered to a group of college freshmen. The test intercorrelations, 
reliabilities and correlations with grades suggest that, in principle, com- 
puters can write tests. 


Computer Analysis of Essays 


One of the most significant developments for the analysis of prose 
has been the General Inquirer, a system of computer programs for content 
analysis of English text (Stone et al., 1966). Bhushan and Ginther (1968) 


reported using the General Inquirer program to analyze essays in a study 
which attempted to discriminate between a good and a poor essay. The 
program was used to take into consideration sentence structure and length, 
vocabulary, and sociological and psychological constructs in the text. 
Additional References: Carlson (1967); Dunphy, Stone and Smith (1965); 
Holsti (1964). 


Exploratory research is currently being conducted by Page (1968) to 
define a set of parameters that can be measured. by computer and that will 
predict English grades given to essays written in a standardized situation. 
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Page (1966) pointed out that psychometricians need a way to measure 
essay quality with the same reliability, validity and generalizability—with 
the same “objectivity’—which they enjoy with multiple-choice items. He 
believed that computers had the potentiality to perform a stylistic and ` 
subject-matter analysis of student essays according to the general 
desired and to deliver to the teacher extensive comments and suggestions 
the student by the first bell the next day. Page (1968) developed a program 
which analyzes the following traits in an essay: (a) ideas or content, 
(b) organization, (c) style, (d) mechanics, and (e) creativity. The inter- 
correlations obtained between the computer corrected essays utilizing this 
program and the same essays rated by human judges were equal to those 
obtained between the human judges. These results lead Page to believe 
that within a very few years computer analysis of prose will be used 
routinely in large examinations and, even beyond that, in ordinary class- 
rooms. 

One of the biggest problems in implementing computer analysis of 
essays is obtaining an economically feasible computer input, since secre- 
taries cannot be expected to copy essays into keypunch machines. The 
input difficulty should be solved by on-line input to a time-sharing system 
or optical scanning machine transformations of typewritten or handwritten 
pages. The Chicago Board of Education was one of the first school systems 
to use an optical scanner which converts typewritten and other printed 
materials into coded input to the computer (Goodlad, O’Toole & Tyler, 
1966). New page readers can read typed copy directly and produce a 
computer tape record of the typed verbal text. Handwritten character and 
numeric readers are now being commercially produced; however, their 
usage is still dependent on the quality of the text preparation. Ni ‘ 
readers tested at Tufts University by the Institute for Psychological Re- 
search have been developed that can analyze 100,000 different numbers — 
written in a variety of ways with an accuracy of 98.5% (Bushnell, 1964). 


Conclusion 


In conclusion, the vision of Cooley (1964) for the future application 
of computer technology to school testing programs seems appropriate. 
Cooley holds that if, in the application of the computer to test scoring, 
no attempt is made to ask questions which would assist in interpreting the 
data, the need for the computer is dubious. The computer’s capacity to 
correlate, compare, interrelate and synthesize data is almost unlimited. 
Cooley visualizes a guidance-measurement system in which the potentiality 
of the computer to examine many test scores simultaneously through 
multivariate analysis will be utilized. School testing programs will shift 
from a system of recording sets of numbers on student cumulative records 
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to a dynamic procedure of “flashing red lights” which Indicate when 
certain students seem to be in icular of danger. Rather than 
assessing how much the student as have tended to do in the 
past, school testing programs will focus on a procedure which identifies 
what missing skills or concepts are interfering with a student's school 
progress. 

Additional References: Cooley and Lohnes (1966); Impellitteri (1967); 
Whitlock (1965). 


away from infancy is provided by the implementation of the COMBAT 
system and the future applications cited above which are presently under- 


way. 
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THE VALUES OF THE ACADEMY 

(MORAL ISSUES FOR AMERICAN EDUCATION 
AND EDUCATIONAL RESEARCH 

ARISING FROM THE JENSEN CASE) 


Michael Scriven 
University of California at Berkeley 


The “Jensen affair” is a case-study of the utmost importance 
to American education. Only if the lessons to be learned from it are heeded 
can the academies avoid the justified condemnation, damage and possible 
destruction that other institutions of this society have suffered and will 
suffer. The universities have already been under fire, but mainly as social 
institutions, for sharing the prejudices of society. The new threat is to 
their essence as the guardians and repositories of scholarship, the paragons 
of learning, the teachers of teachers of teachers. 


The Problems 


Some of the questions raised, often stated in their exceedingly preju- 

dicial, naturally occurring form, include the following: 

1. Should any researcher publish results (or, in general, do anything) 
that will probably be used to further unjust causes? 

2. Can any studies done in an extremely racist society demonstrate 
that the genetic component explains most of the performance deficit 
shown by the oppressed group? 

3. (a) Can any journal justify publishing such results? (b) Under 


y , fi le, commissioned this article but refused to publish it; refused to 
sell iis eae the Jensen article unless the purchaser bought the “rebuttal 


issue (but not the counter-rebuttal), ete. 
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what circumstances can a journal restrict the topics to be oo 
by an invited contributor? (c) What are the legitimate me 
a journal can take to compensate for such publication or attend 
publicity, if any? 

4. Should a university hire or retain researchers who publish “racist 
research? 


5. (a) Should academics engage in or condemn dissemination tactics: 
when asked to publish or review colleagues’ work in scholar 
journals? (b) Should they discuss in their publications the prob 
moral consequences of their publications? 

6. Should researchers select their areas of work with careful co 
sideration of the social consequences of probable outcomes? 


7. Is this kind of research result self-fulfilling? 

8. Does IQ measure intelligence well? 

9. Are value judgments legitimate in science? 

10. What are the social and moral implications of intelligence dif- 
ferences and IQ differences? 

11. What is racism? 


The Situation 


The present situation, which will probably be different (perhaps 
shockingly so) by the time of publication, is roughly as follows. Jensen 
has been savagely attacked for being a racist; he has been attacked in 
print and in speech by the radical left at Berkeley. His dismissal has been 
incorporated into the short list of demands for campus reform by the 
Students for a Democratic Society. He has also been savagely or 
attacked in print or speech by many of his Berkeley colleagues. A 
letter, signed by someone who put “PhD” after his name, mentions one 
article which allegedly refutes Jensen’s whole study and which was by 
implication not considered for that reason. Another letter, signed by several — 
extremely distinguished faculty members, casually dismissed Jensen’s con- — 
clusions as essentially ignorant; it now appears that several and probably 
most of the signers had not read the article. The Teaching Assistants’ 
Union at Berkeley stated that Jensen should not be fired “despite his 
racism,” but much anger was vented on someone who suggested that the 
quoted phrase be dropped as unproven. The reference to Jensen was inci- 
dental support for a motion condemning the proposed dismissal of an 
instructor who went on strike the previous quarter; it was used to suggest 
inconsistency in the administration’s attitudes. (Some letters in support of 
Jensen have also been published.) The attempts in leading articles to 
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The above news notes are not guaranteed to be one hundred percent 
scholarly, but I believe they convey the Berkeley scene correctly. The 
national media have ted a picture that is distorted in exactly 
same direction— often with different motives. That is, they ha 
to a greater or less extent taken Jensen to be claiming a more or less 
universal difference between black and white intelligence, which 
less justifies economic and educational discrimination. The qualified 
fessionals have reactions of which a reasonable sample appears i 
“rebuttal” issue of the Harvard Educational Review (Spring, 


z 


z 


i 


: 
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than is implicit in the above descriptions, I want to suggest some aspects 
of it that I think are far more important for discussion than is 
one man’s opinions of this complex issue. 


The Crucial Educational Significance of the Case 


It is easy for academics to view with alarm the sight of, the left at 
Berkeley joining in the attack on faculty, an attack in which the right 
has long been the only t. It is easy to show that much of 
criticism by the left in case is ill-formed, illogical, and politically 
motivated. It is easy for many academics to feel gratified 
onraad epicions of the radical left as constituting a grave threat to 
academic freedom,—Sidney Hook, for example. op eae 

It is not so easy to admit that this condemnation 's criticism 
srs the elpelty ofthe academy and the allegations of the rie 


3 
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put forward such bad arguments, (b) advocate and engage in such in- 
tolerant action, (c) succeed in enlisting support from thousands of their 
fellow-faculty and fellow-students? The third point is the most important 
and the conspiracy explanations favored by Hook and others are simply 
inadequate to handle it. The alternative explanation, which is necessary 
to handle (c) and is in my view a better explanation of (a) and (b), 
is simply that the educational system has never given these students the 
skills to analyze such issues, the desire to do so, or any, good renons for 
believing that the liberal academy’s commitment to free inquiry and 
academic freedom is more than another shibboleth of a corrupt society. 
This statement does not imply that the radicals have no responsibility 
for the errors of their rhetoric. It does mean that they have been given 
no reason to believe, or no rational training to recognize, that quibbling 
about errors in rhetoric is more than an academic hang-up in a society 
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which needs revolution. The academy has not even taught its own values; 
at most it has mouthed them and more often ignored them. 

The academy has been described as a collection of professional scholar- 
entrepreneurs coasting along as amateur instructors. They not only know 
nothing about teaching, it is said, but they view it as insulting to suggest 
that they should and absurd to suggest that they do not. With respect to 
teaching, which is their own primary obligation to society, they reject the 
idea of scholarly evaluation. I consider that a truly appalling indictment, 
and I believe the major impartance of the Jensen affair is to demonstrate 
its truth. It is trivial to try to settle the substantive issues in the case 
without first acknowledging its shocking significance. Whether Jensen is 
right or wrong and whether his critics are rational or irrational is a slight 
matter in the long run compared to what the dialog has revealed about 
the incompetence in the crucial areas of our students and their teachers. 
For a long time it has been clear that the irrational reactionary attacks 
on the university and the school are indirectly an indictment of us since 
it is our graduates that attack us in a way that demonstrates that they 
never heard Mill’s arguments or understood Marx’s. But their views, both 
particular and general, are intrinsically implausible, and their connection 
with the educational system is only in the distant past. The impact of 
the criticism from the new left is more upsetting because those students 
are here, now, in our classes or are recent graduates of them, and because 
their most fundamental criticisms of the university as a preserve of privilege 
rather than a place for students, as an essentially conservative force in the 
society, are supported whether they are right or wrong in this particular 
case. 


A Possible Defense and Rebuttal 


Is there really anything known about teaching that has been ignored 
by the amateurs in the academy? If not, they are hardly to blame for 
failing to instill reason in these students and faculty. 

Our knowledge is at three levels, and it has been ignored at each 
level. (It is also very limited and that fact is itself part of the indictment, 
perhaps the most important part. The feeling in “subject matter depart- 
ments has always been that research on any conceivable topic other than 
teaching is respectable—except that if you can not get into the regular 
psychology department then you may finish up in educational psychology 
investigating the optimal location of pencil sharpeners in the classroom: 
a standard academic joke and a symptom of sick .values in those that do 
not go beyond it.) 

We have some commonsense knowledge about teaching; we have some 
moral knowledge about it; and we have some particular research results of 
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value. I mention only some illustrative points, We require an examination 
in a foreign language in the name of liberal education, but we do not 
require an examination of liberalism, education or examinations. We re- 
quire statistics for researchers but never make them consider the social 
contribution of their proposed work, We are teaching citizens for a demo- 
cracy but we do not teach them to define, discuss, defend -or-detract-from 
democracy. We send them to war but never discuss pacifism. We grade 
them in every class but do not defend the grading procedures. We teach 
them “logic” but not how to analyze arguments; “ethics” but not morality; 
“communism” but never by communists. Each of these examples violates 
one or several standards of sense or sensibility. In short, we reap exactly 
what we sow and then show shock! 


The Two Non-Cultures 


The values of the academy are as shoddy as everyone else's, though 
no doubt in slightly different ways. Lord Snow thought the gap between 
the two cultures was a reflection on the educational system, which is true 
enough, but he was exposing a gap in an ancient skeleton. The putrefying 
body is far less attractive. The problem is not two cultures but two 
noréultures. Take a scientist out of his home lab and ask him pe 
the scientific method to a new area, say behavioral genetics, speed reading, 
hypnosis, psychoanalysis, or parapsychology and—instant irrationalism. 
(The AAAS refused three times without giving reasons to grant affiliation 
to the Parapsychology Association.) Take a humanist away from his 
library and ask him to discuss the dehumanization of US soldiers in training 
or in Viet Nam, of police or people in cities, of students and faculty in 
multiversities and—instant disclaimers of competence. So it is a narrow- 
minded clergyman who is appointed to draw up the new compulsory 
moral education curriculum in California and his committee adopts the 


teaching demonstrated to be totally ineffective. But how many graduates 
of the University of California know enough about that study to vote down 
the curriculum? How many of the faculty know about it? It was a scientific 
study of the traditional ways of teaching humans to respect the rights of 
humans, and because of the academy's failure to teach what matters, we 
are now set up to repeat the mistakes in education yet again. 

Now look back at the eleven questions in the second section above. 
How many were systematically discussed in your professional education? 
Most of them are crucial for every academic professional, especially in the 
educational field. But the academy treats professional ethics as it treats 
the only common professional skill, teaching, something beneath the 
academy’s dignity to discuss. In the Jensen case we are reaping only the 
first fruits of that neglect. Worse will surely come. 
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Read Jensen’s article and then ask, “Is its topic socially and politically 
important?” The answer is “yes.” Is it morally important? Again, “yes. 
Is it intellectually important to the sciences as well as to the humanities? 
“Yes.” Then why is the level of discussion of it so appalling? It is so only 
because the students, like most faculty, are not taught the skills, the data 
or the attitudes necessary for handling and acting on controversial, moral- 
political-scientific issues. The very rudiments of scientific method require 
an understanding of control-group methodology, of the significance of 
twin-studies, or the difference between population statistics and sample 
inferences—not hours of fiddling with quantitative balances in chem. lab. 
There are several thousand US citizens dying every year because nobody 
thought their lives would depend on understanding carcinogen statistics— 
on understanding by them, as well as by scientists and by politicians— 
understanding that is deep enough to motivate. The primer of political 
and moral reasoning involves understanding that the worth of people and 
their rights do not depend on IQ. But since we have stuffed our students 
with an absurd value-free ideology in the social sciences (Scriven, 1966), 
they have had to construct their own anti-war movement and their own 
standards of the worth of research from that barren ground. It is no 
surprise that their own standards are not those of the academy. Several 
thousand of them will be dying in the jungles this year for a cause no one 
can prove just; it is not too surprising if they do not all embrace the values 
of a society which has never bothered to discuss those values seriously or 
to implement them carefully. Their neighbors and families will live in 
deteriorating conditions at home so billions of dollars can be spent for a 
defense system that virtually all experts not paid by the Pentagon regard 
as absurd or so a man instead of a machine can be landed on the moon. 
In the presence of this background, in the context of the noniversity, and 
in the absence of the appropriate technical education, one could hardly 


expect a friendly rational chat about the distribution of deviance as a 
reaction to Jensen. i 


The Intrinsic Merits of the Case 


Consider the printed issue itself, Jensen’s article and its commentators. 
Almost all the political critics make errors so gross that intrinsic criticism 
is otiose. Four points are possibly worth mentioning. First, many of the 
radicals argue that it is wrong to publish, and perhaps even to get, results 
which can be misinterpreted or misused, e.g., in the cause of racism. It 
would follow that their protests should be silenced since such protests will 
certainly be used to discredit the movement: their argument is self-refuting. 
It is also bad since the short-term abuse of these data may well be offset 
by long-term gains in allocating compensatory education funds more 
appropriately and achieving a gain in performance. 
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Second, it is hard not to suspect that it is the radicals who are hung-up 
on the idea that intelligence is an important measure of real social worth. 
Why do they react so violently to an alleged difference? If it exists, would 
it justify condescension? After all, even the existing TQ difference between 
black and white, much of it environmental, is not very exciting by com- 


ments can exceed the inter-racial difference in means—so the alleged genetic 

giving up on attempts to revolutionize the 
environment. That it may be (wrongly) interpreted to do so is no more 
grounds for attacking the truth of the claim than for attacking the claim 
that genetic differences are crucially responsible for the low frequency of 
interracial marriage. At first sight, you think “In some cultures (environ- 
ments) it might even be a mark of great merit to marry cross-color so it is 
only a feature of our society that it opposes intermarriage.” But of course 
white racist prejudice against miscegenation absolutely depends on the 
black-white skin difference, which is inherited. Other prejudices are not 
genetically dependent. The anti-Jensen radicals really have not seen 
through the usual mythology of the acquired /inherited distinction. They 
should certainly attack social institutions that unreasonably reward high 
IQ, though there are fewer than they think. 

Third, the concept of racism involved in the charges is peculiarly 
ill-defined. Racism originally meant the conscious or influence 
of racial characteristics on attitudes, actions, or arrangements to which 
these characteristics are actually irrelevant. In the past year or SO 
charge of racism in the universities has been supported by evidence to 
which it is, in this sense, irrelevant, and charity suggests we introduce 
another sense of the term. In this sense, which might perhaps be called 
“passive racism,” one is racist if he has not done all he might reasonably 
the effects of racism by mhap Sn a not the 
slightest evidence in his article that Jensen is a racist in e er sense, 
lhou on general grounds the whole of mankind probably is laissez-faire 
racist including most radicals. Jensen can be identified as a racist only 
in the sense that his work can be (mis-)used by racists to support their 
case. If that kind of redefinition is allowed, then the revolutionaries are 


Fourth, the biological definition of genetic differences is not very sig- 
nificant socially. It might well be that black Americans are now genetically 
superior in physical strength or anything else because of the selection 
pressures in the plantations. The continued effect of environmental factors 
leads to a genetic difference, but the genetic difference is no less the result 
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of social factors than is the accent of a Southerner; there is simply a 
slightly longer time-scale involved. The selection of blacks by slavers in 
Africa may well have been connected with standards of strength or physical 
appearance or sociability—or intelligence. Thus, the original gene pool of 
the US black may have been atypical of the African population. Even 
this fact might well be seen as a social effect. A genetic difference now 
is surely not a crucial issue for social policy, it is just a snapshot of the 
passing process. The social question is simply how to make the process 
into progress—whatever its present status. The discovery that some children 
do not learn to enjoy or play music from the usual methods of teaching 
is simply a challenge to find new ways to teach them. 

The heart of the (academic) matter is this: do the data support the 
claim of interracial genetic differences? Hunt (1969) thinks not, but he 
believes this on general inductive grounds which appeal to external data 
—he is not pointing out a fallacy in Jensen’s statistical analysis or in his 
knowledge of the obviously relevant data. Moreover, Jensen does appeal 
to one extremely relevant kind of external data—the views of evolutionists. 
It is hard not to conclude that the main difference between Jensen and 
Hunt on the basic point is a legitimate difference of judgment at this stage 
of the game. I think there is no doubt that Hunt and Cronbach (1969) 
put flesh on the bare bones of the null hypothesis and we must agree that 
there may, even now, be ways to get those population means to match 
by selective education. (It is also obvious that selective mating could do it 
and may have undone it.) But should anyone care? Trying to fulfill the 
potentialities of each student is society’s obligation, not equality of achieve- 
ment along one problem-solving achievement scale. Jensen keeps his aim 
rather steadily on the right goals. Whether his fairly simple division of 
capacities is the key or whether his reading of the evidence is always 
impeccable, it is a social task of considerable importance to consider what 
goals and what techniques are most appropriate for compensatory education 
now and thus indeed for all education. 

It would be a serious error to suppose that discovering high heritability 
for “g” tells either that highly valuable gains Bite be os in this 
generation or that very large gains cannot be made in the normal time it 
takes to replace the present school faculties with people trained in a new 
and more appropriate approach, i.e., two to three generations. If at times 
Jensen appears less sanguine about this than a strict view of statistical 
inference allows, it is certain that the best trained environmentalists writing 
about the subject have gone as far and often much further in the opposite 
direction: we are well within the range of professional judgment. In general, 
it is worth remembering that the standard of criticism here is not whether 
fifty critics can pick some holes in a hundred-page survey article covering 
several hundred references—each critic knows this would happen to him 
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if he was the author—but whether the criticism destroys 

of the article, The attendant furor in the Jensen ee eek ot 
attention on one point in it; but even on that one point, and after all the 
criticism, Jensen’s case is, I believe, not only well within the boundaries of 
legitimate professional interpretation, but an extremely important position 


to consider. 
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ATTITUDES TOWARD MATHEMATICS 


Lewis R. Aiken, Jr. 
Guilford College 


In her “Review of Research of Psychological Prob- 
lems in Mathematics Education” Feierabend (1960) devoted about ten 
pages to research on attitudes toward mathematics.’ During the past decade 
a number of published reports of conference proceedings have been con- 
cerned with mathematics learning (e.g., Hooten, 1967; Morrisett & Vinson- 
haler, 1965), but these reports do not treat research on attitudes in detail. 
Because the number of dissertations and published articles dealing with 
attitudes toward mathematics has increased geometrically since Feierabend’s 
(1960) report, it is time for a reappraisal of the topic. 


Wilson (1961) maintained that before “progressive education” came 
on the scene, more school failures were caused by arithmetic than by any 
other subject. Although the number of failures in arithmetic and mathe- 
matics may have decreased somewhat in the past fifty years, it is debatable 
whether modern curricula have fostered more positive attitudes toward the 
subjects. One must question how general these negative attitudes are, what 
causes them, and what can be done to make them positive. A committee 
formed some years ago (Dyer, Kalin & Lord, 1956) to study these questions 
concluded that more information was needed for adequate answers—infor- 
mation about biological inheritance and home background of the pupil; 
attitudes and training of teachers; and the content, organization, goals, 
and adaptability of the curriculum. A fair question is: “What information 
on the influences of these three types of factors has research provided since 
then?” The purpose of this review is to answer that question as it pertains 
to research conducted during the past ten years. Feierabend’s (1960) review 
should be consulted for a summary of earlier investigations. 

The interpretation of results depends to some degree on the types of 
measuring instruments or techniques employed in the research. Therefore, 
the review will deal first with paper-and-pencil, observational, and other 
methods of measuring attitudes toward mathematics which are described 
in the recent literature. Next, studies pertaining to the distribution and 
stability of attitudes and the effects of attitudes on achievement in mathe- 


$ i definition of the term attitude, in general it refers 
Alona A ogres on the part of an individual “to respond positively 
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matics will be considered. Then, in response to the challenge of Dyer, 
Kalin and Lord (1956) referred to above, findings about the influences 
of the home environment, the personality characteristics of the student, 
the teacher, and the school curriculum on student attitudes toward mathe- 
matics will be summarized. Next, research and discussions of techniques 
for developing positive attitudes and modifying negative attitudes will be 
reviewed. In the final section of the paper, the investigations which have 
been reviewed will be evaluated and some suggestions for further research 
will be offered. 


Methods of Measuring Attitudes Toward Mathematics 


Although it has been said (Morrisett & Vinsonhaler, 1965, p. 133) 
that there are actually no valid measures of attitudes toward mathematics, 
the fact remains that a number of techniques—some of them quite ingenious 
—are available. Several of these techniques were described by Corcoran 
and Gibb (1961): (1) observational methods; (2) interviews; (3) self-report 
methods such as questionnaires, attitude scales, sentence completion, 
projective techniques, and content analysis of essays. Although the majority 
of investigations have dealt with attitudes toward mathematics in general, 


pp ent specific courses or types of mathematics problems can also 


Observation and Interview 


e. on pan observed behavior would seem to be an important 
made ae ch attitudes, but Brown and Abell ( 1965) found observations 
toward ers ea be inadequate for appraising their students’ attitudes 
correlation Ry Ellingson (1962), however, found a significant positive 
senior high E 1 ) between the inventoried attitudes of 755 junior and 
toward aie and teachers’ ratings of the students’ attitudes 


Another fairly obvious Way to assess attitudes is to ask the pupil how 


pies oe are Shapiro (1962) used this method in a semi- 

toward arithmetic 5 ton Mined at determining the feelings 
i ys an irls. Th ils’ iud 

determined by ratings of the 90 eni made iy ies ees. E 


Questionnaire Items 
Dreger and Aiken (1957, p. 346) administered the following three 


questionnaire items to a group of college students to determine their feelings 


toward mathematics, 
false. The students responded to each item with true or 


552 


AIKEN ATTITUDES TOWARD MATHEMATICS 


“I am often nervous when I have to do arithmetic. Many times 
when I see a math problem I just ‘freeze up.’ I was never as good 
in math as in other subjects.” 


Kane (1968) constructed another sort of questionnaire to measure 
attitudes toward mathematics and other school subjects. The college- 
student examinees were instructed to indicate which of four subjects— 
English, mathematics, science, and social studies—they (1) most enjoyed 
and found most worthwhile in high school, (2) most enjoyed in college, 
(3) learned the most about in college courses, (4) would probably most 
enjoy teaching, and (5) were probably most competent to teach. Their 
attitudes toward mathematics were indicated by the extent to which mathe- 
matics was preferred or selected over the other three subjects. Other 
examples of non-scaled questionnaire items and more formal questionnaires 
such as those devised by the semantic differential technique (see Anttonen, 
1968) could be supplied, but a more popular instrument for measuring 
attitudes is the attitude scale. 


J 
Attitude Scales 


There are several attitude-scaling procedures; a few of them are de- 
scribed briefly in the following paragraphs. Readers who desire asea 
information on techniques of attitude-scaling are encouraged to consult the 
book by Edwards (1957). In Thurstone’s method of et 
intervals, each of a series of statements reflecting oe, ol 
negative and positive attitudes toward something is given a scale e: 
the median of the scale values assigned to it by a group of judges. À 
respondent’s score on a scale consisting of a series of such Ten s 
the sum or mean of the scale values of the statements which he endorses. 


In Likert’s method of summated ratings, the respondent indicates 
whether he strongly agrees, agrees, is undecided, disagrees, or moody 
disagrees with each of 20 or so statements expressing ae ji sah 
attitudes toward something. His score on the scale is ee pe ch és Bie Š 
(successive integers such as 1, 2, 3, 4, and 5) whi R ET Sera A 
to the particular responses which he makes. On boi e j i arya 
Likert scales, high scores indicate a more favorable attitude tow 
particular topic of the scale. 

The Thurstone and Likert attitude-scaling techniques fas popular 
procedures for measuring attitudes toward mathematics; a a j si 
for scaling attitudes—Guttman’s scalogram ee ae ue 
frequently. This is probably because the Guttman scaling p k Ne: 3 
that the items to be scaled be on a single dimension, so tha E 
respondent endorses one item he will endorse all items having a lowe 
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scale value. Such a restriction is more likely to be satisfied for cognitive 
test items than for affective items like attitude statements. 

Example of a Thurstone Scale. The scale of attitudes toward arithmetic 
which has probably been used more than any other is Dutton’s scale 
(Dutton, 1951, 1962). This 15-item scale is given in Dutton’s 1962 paper; 
it consists of a variety of statements expressing positive and negative 
attitudes toward arithmetic. It was originally constructed to measure the 
attitudes of prospective elementary-school teachers, but it has also been 
administered in junior high school (Dutton, 1968) and even as early as 
the third grade (Fedon, 1958). In Fedon’s (1958) study, the children 
indicated the intensity of their attitudes with a color scheme which varied 
from red for an extreme Positive attitude through yellow to convey a 
neutral attitude to black for an extreme negative attitude. Dutton’s scale, 


and twelfth graders, a rather wide range for any psychometric device. 
Examples of Likert Scales. Many researchers have preferred Likert 
scales because they are usually easier to construct than Thurstone or 
Guttman scales. In their book on attitude scales, Shaw and Wright (1967) 
included two Likert scales for the measurement of attitudes toward mathe- 
matics—a 12-item, modified Likert scale by Gladstone, Deal and Drevdahl 
(1960) and a Revised Math Attitude Scale by Aiken (1963). The original 
(1961), nakas wale appeared in an article by Aiken and Dreger 
pra te i Stellwagon, and Becker (1963) described the Likert-type 
Abili md es used in the National Longitudinal Study of Mathematical 
Ptr Base of the Stanford-based School Mathematics Study 
items were Broker ee analysis of the NLSMA data, forty atiinde yee 
pe ane broken down into a number of subscales, for example: “pro- 
anxiety.” pe as w eigen ae self-concept,” and “debilitating 
attitude scale based on Brock (1967) devised a 35-item mathematics 


p on the six attitud 
k nal Ohaa ae abe inal levels of the Taxonomy of 


reworded the most discrim 
type scale and formed them into a Likert- 


Other Measures of Attitude 


Nealeigh (1967) experimented wi 
projective measure of pupil attitudes mi ‘ 
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mathematics. The pupil was shown 310 pairs of pictures (one member 
of each pair contained a math concept) and told to indicate which of the 
two pictures he preferred. The “math concepts” included in the pictures 
were those of symmetry, similarity, order and pattern. Attitude and 
achievement in mathematics were assessed in another way and compared 
with pupil responses to the picture preference test. Although certain 
pictures discriminated between pupils with positive and negative attitudes 
and between pupils with high and low achievement, the pictures that were 
most discriminating with third graders were not necessarily the same ones 
that were most discriminating with seventh graders. 


An attitude is commonly considered to be cognitive and partly 
affective or emotional. Therefore, it would seem that information about 
attitude, or at least about its emotional component, could be obtained by 
measuring autonomic responses to selected stimuli. Such measurements 
are too cumbersome for mass assessment of attitudes, but they have been 
employed in research. Dreger and Aiken (1957) measured changes in 
electrical skin resistance (GSR) in 40 college students while the Verbal 
scale of the Wechsler-Bellevue Intelligence Scale (WB-I) was being 
administered to the students. Statistically significant GSR’s were obtained 
during the arithmetic instructions and the arithmetic subtest of the WB-I, 
but only for those subjects who had been independently identified as 
anxious about mathematics, de ithe 

Milliken and Spilka (1962) measured breathing dep rate, 
blood pressure, heart rate, and GSR during the first and last 30 seconds 
of the time that their subjects were taking each subtest of Teon 
Council on Education Psychological Examination (ACE). eg i 
showed that an examinee who was low in mathematical and high in verba 
score on the Scholastic Aptitude Test gave greater gh aed E ia 
during administration of the ACE quantitative tests. A aya “0 
in general gave greater physiological responses during aad ae 
tests than during the quantitative tests; the reverse was true fo 7 


Grade Distribution and Stability of Attitudes 


The Elementary-School Years 


i dults 
It is generally recognized that attitudes toward mathematics in a 

can be Gad to childhood (Morrisett & Vinsonhaler, 1965, p. pe ae 
is evidence that very definite attitudes toward oe ad oo 
as early as the third grade (Fedon, 1958; Stright, 1960), ie er rere 
tend to be more positive than negative in elementary a ; : : 
1960). For example, a survey by Herman (1963) of the pa re ar 
preferred by a group of fourth, fifth, and sixth graders found that arithme 
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was typically in the middle when subjects were ranked from least to most 
preferred. For boys, the order of the five subjects from least-liked to most- 
liked was English, social studies, arithmetic, science, and spelling. For 
girls, the order from least-liked to most-liked subject was social studies, 
science, arithmetic, English and spelling. 

Indirect information concerning the grade distribution of attitudes 
toward mathematics is found in reports given by groups of college students 
majoring in education. In general, the students reported that they developed 
their attitudes toward arithmetic throughout the school grades—from second 
through twelfth grade, but that the intermediate grades—fourth through 
sixth—were more influential (see Dutton, 1962; Smith, 1964; White, 1964). 
This seems reasonable, because these are typically the three grades in 
which arithmetic is stressed most. In McDermott’s (1956) case studies of 
34 college students who were afraid of mathematics, the majority reported 
having first met with frustration in the elementary grades; the remainder 
stated that they met difficulty when they attempted the use of algebraic 
symbols and other higher mathematical concepts in secondary school. 

Interestingly enough, there is some evidence for a decline from the 
third through the sixth grade in the percentage of pupils expressing negative 
attitudes toward arithmetic (Stright, 1960). However, the change may be 
due to increasing social sophistication on the part of the pupils or to an 
increased willingness to simulate positive attitudes because they have been 


told that mathematics is good for them and that positive attitudes please 
the teacher. 


The Junior-High-School Years 


i The results of a number of studies point to the persistence of negative 
attitudes toward mathematics as students ascend the academic ladder. 
In the traditional curriculum the junior high school has been the period 
during which algebra and other abstract mathematics were introduced. 
Therefore, it is noteworthy that the greatest percentage (40%) of the 
prospective teachers surveyed by Reys and Delon (1968) reported the 
junior-high-school years as the period when their attitudes toward arith- 
EA reached a peak of development. Even under more contemporary 
onosi curricula, junior high school seems to be a critical point in 

e oo is attitudes toward mathematics (see Dutton, 1968). 

, „Dutton and Blum (1968) surveyed by questionnaire the reasons for 
a“ pe liking arithmetic in 346 mH i seventh-, and eighth-grade 
Se 3 o had been taught “new math” for at least one year. The most 
neue pte which the students gave for disliking the subject were: 

King problems outside of school, word problems that were frustrating, 
possibilities of making mistakes in arithmetic, and too many rules to learn. 
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A large percentage of the pupils agreed with the statements that arithmetic 
should be avoided whenever possible, that one cannot use new mathematics 
in everyday life, and that arithmetic is a waste of time. Favorable attitudes 
endorsed by pupils were that working with numbers is fun and presents a 
challenge, and that arithmetic makes one think, is logical and is practical. 


Dutton (1968) suggested that there was a decline during the past ten 
years in the number of junior-high-school pupils expressing negative atti- 
tudes toward arithmetic, but that a sizable percentage of pupils still lack 
self-confidence in the subject. However, Dutton compared the attitudes of 
a group of junior-high-school pupils of a decade ago with those of a current 
group, and this procedure probably did not result in equivalent groups. 


Longitudinal Studies of Attitudes Toward Mathematics 


possible inappropriateness of the same attitude measure at different grade 
levels. Fedon’s (1958) use of a color scheme for indicating intensity of 
attitude in the lower grades represents an interesting attempt to extend 
downward a scale which was constructed for students on a much higher 
level. 


Actually, there have been very few longitudinal studies of attitudes. 
My onpabhaiea analysis of the mean scores on the SMSG mathematics 
attitude scores obtained by the same group of approximately 1000 children 
in grades four, six, and eight revealed significant changes across grade 
levels in mean scores on some scales, although these changes were not 
very dramatic (data from Wilson et al., 1968). Anttonen (1968) ri 
tered 94 attitude items arranged into 15 Guttman-type scales to 607 - 
and sixth-grade Minnesota school children in 1960. The — par 
readministered to a portion of the same group six years later when = 
students were in the eleventh and twelfth grades, respectively. Natura! y 
there was some attrition in the sample over the period of six yan > 
this should be taken into account in interpreting the results. The correlation 
between attitudes toward mathematics in elementary and sai — 
was relatively low (average r of 30) for the entire group, a ox es = 
sexes considered separately, and for four different patterns of ma hema a 
coursework. However, scores on the Guttman attitude scale administer 
in senior high had a high correlation with a semantic differential measure 
of attitudes administered during the same “ae Shae 

In sum, it seems possible to measure attitudes toward arithme 
mathematics as early pa the third grade, but, as in any carina 
affected by development, such attitudes are probably not very stable i 
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the early grades. In addition, the preciseness with which pupils can express 
their attitudes varies with level of maturity. Finally, scores on the instru- 
ments employed in the majority of studies represent composites of attitudes 
toward different aspects of mathematics rather than measures of attitude 
toward a specific part of the subject. These “generalized attitude” instru- 
ments may overlook important facets of the variable of interest, For 
example, attitude toward materials to be learned by rote, such as the 
multiplication table, is probably not the same dimension as attitude toward 
word problems or algebraic symbols. In any case, of greater importance 
than the exact frequency of attitudes at different grade levels are the causes 
and effects of these attitudes. 


The Relationship of Attitude to Achievement in Mathematics 


Obviously, the assessment of attitudes toward mathematics would be 
of less concern if attitudes were not thought to affect performance in some 
way. The relationship between attitudes and performance is certainly the 
consequence of a reciprocal influence, in that attitudes affect achievement 
and achievement in turn affects attitudes (see Neale, 1969). This dynamic 
interaction between attitudes and behavior has received a great deal of 
1), b n kee recent ee ores literature (see Festinger et al., 

» bu that deal i i i i 
efit to y 33% eal specifically with mathematics learning 

Bernstein (1964) maintained that if certain feelings are experienced 
for a time they will lead to a particular self-image on ue part of the pupil 
= self-image which will influence his expectation of future performance 
ree affect his actual performance. Data collected by Kempler (1962) 

a bearing on Bernstein’s assertion; Kempler’s data suggest that 
sel -confidence in one’s mathematical ability, as measured by a 15-item 
Fj cn is associated with rigidity in mathematical tasks like the 

‘5 a water-jar test. Behaviors indicative of the rigidity which students 
=; Le fete frustrating mathematical tasks, causing them to be anxious 
‘oats Ratan award the subject, include resorting to rote memory and 
3 rigors me wn and relying on other people and dishonest means in 
up” SESSE ( Nine 1956). In contrast to the rigidity and “giving 
Reet case ose who dislike mathematics is the constructive perse- 

ce o! who like mathematics. Thus, Shapiro (1962) found that 

t baa solutions to arithmetic problems was higher in 

S PENR chool children who liked mathematics than in those who 
i s adiis as a group were more persevering than boys. 

similar analysis of the relationships among attitude, expectation, 

and pe was made by Alpert et al. (1963), who view level of 

expectation and performance as a kind of self-perpetuating cycle affecting 
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a child’s self-concept; attitudes and anxiety are closely related to this 
concept. The idea of a self-perpetuating cycle linking expectation and 
performance is consistent with the observation that the variability of 
arithmetic performance increases as pupils proceed through elementary 
school. That is, the difference in performance between the poorest and 
the best pupils becomes progressively greater as they ascend the academic 
ladder (Clark, 1961). 

The relationship of attitudes, which are integrally related to expec- 
tations, to performance appears to be especially important in mathematics 
learning. One study (Brown & Abell, 1965) clearly demonstrated that 
the correlation between pupil attitude toward a subject and achievement 
in that subject was higher for arithmetic than for spelling, reading, or 
language. The following sections contain a review, by grade level, of the 
results of research over the past decade on the relationship between attitudes 
and achievement in mathematics. Unfortunately, these studies are not 
always consistent in their findings, although they generally report low 
to moderate correlations between the variables. In addition, one investi- 
gator (Neale, 1969) has argued that patience, compliance, and obedience 
are more important than attitudes as determinants of achievement in 
mathematics. 


Elementary-School Level 


In a study of the attitudes toward problem solving of a group of 
Brazilian clement school children, Lindgren et al. (1964) obtained a 
small but significant positive correlation (r = - 24, N = 108) between 
problem-solving attitudes and scores on an arithmetic reel a 
and a positive but not significant correlation between attitudes : = s 
in arithmetic. Shapiro (1962) found that her interview measure o! oe en 
in sixth graders was significantly related to grade placement on the : en e 
Range Achievement Test, to all parts of the arithmetic wae of the 
California Achievement Test, and to school marks in arithmetic. ippon 
(1968) obtained consistently low correlations of mathematics cna 
scores with grade averages and with the arithmetic total — on on wea 
Tests of Basic Skills in fifth- and sixth-grade pupils. In Ke evide 
for a relationship between attitude and achievement Jen pii a ai 
by Dutton (1962), who found a low positive relationship sree e 
attitudes toward arithmetic of a group of college students and their 
reported arithmetic grades in elementary on eras 

Obviously, the correlations between attitude and achieve - 
mentary al though statistically significant in certain instances, are 


typically not very large. One investigation of sixth graders (Cleveland, 


lly discriminate 
1962 led that attitude scale scores did not generally 
betae ted and Tow -achietees in arithmetic. A difficulty with self-report 
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inventories at the clementary-school level is the readability and inter- 

ility of the attitude instrument; another problem concerns the degree 
of self-insight and conscientiousness with which the pupils fill out the 
inventory. Fortunately, these problems do not appear to be quite as 
serious at higher grade levels. 


Junior-High-School Level 


Summarizing the results of a survey of 270 seventh-grade boys and girls, 
Alpert et al. (1963) reported significant correlations between performance 
in mathematics and measures of attitudes and anxiety toward mathematics. 
Similar results were given by Degnan (1967), Stephens (1960), and 
Werdelin (1966). In a comparison of accelerated and remedial mathe- 
matics classes, Stephens (1960) administered Dutton’s attitude scale to 
six seventh-grade and six eighth-grade classes. The mean attitude score 
of the accelerated group was significantly higher than that of the remedial 
group. Therefore, Stephens concluded that attitude score might be used 
with achievement scores for placement in special classes. 

Degnan (1967) compared the attitudes and general anxiety levels of 
22 eighth-grade students designated as low achievers in mathematics with 
those of 22 eighth-grade students designated as high achievers in mathe- 
matics. Dutton’s (1962) scale was the measure of attitudes; the Children’s 
Manifest Anxiety Scale (Castaneda et al., 1956) was the measure of general 
anxiety. Although it was found that the achievers were generally more 
anxious than the underachievers, the achievers had more positive attitudes 
toward mathematics. Also, when the students were asked to list their major 
subjects in order of preference, the achievers gave mathematics a signifi- 
cantly higher ranking than the underachievers, Among other things, the 
results of this study show that attitude toward arithmetic and general 
anxiety are not the same variable, a conclusion related to the earlier find- 
ing of Dreger and Aiken (1957) that “general anxiety” and “math anxiety” 
Tor the same. The study also demonstrates that anxiety may act as a 
ros itating factor in achievement, as noted by Alpert et al. (1963) in their 

istinction between “facilitating anxiety” and “debilitating anxiety.” 


High-School Level 


In his longitudinal stud: i 
r y of attitudes, Anttonen (1968) reported mod- 
ari correlations of mathematics attitude scores with Sea cities grade- 
point averages and standardized test scores in eleventh and twelfth 


graders. Achievement was also i 
; greater for students whose attitudes had 
remained favorable or had become favorable since elementary school. 


College Level 


Due perhaps to the greater accessibility of subjects, it is not surprising 
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that many investigators prefer to work with college students. College 
students, on the average, presumably have more positive attitudes toward 
academic work than their non-college counterparts. Therefore, it would 
seem that the frequency of negative attitudes toward mathematics, and 
consequently the variability of the distribution of attitude scores, should 
be lower for college students than for the general population. If this is 
true, then one might expect a somewhat smaller correlation between atti- 
tudes and achievement in college than in high school. However, college 
students may fill out attitude inventories more conscientiously and with 
greater self-insight than the population as a whole, and this would promote 
higher atttitude-achievement correlations. 

Some investigators have found rather low correlations between mathe- 
matics attitudes and mathematics achievement in college students. Har- 
rington (1960) reported a statistically insignificant relationship between 
attitude and performance in college mathematics courses, although he did 
find that selection of a mathematics course vs. no mathematics course was 
significantly related to attitude. Somewhat more substantial relationships 
between attitudes and achievement were obtained by Dreger and Aiken 
(1957) and Aiken and Dreger (1961). In the former study there was a 
correlation of —.44 between the final grades of 704 students in a freshman 
mathematics course and their scores on a three-item inventory of anxiety 
in the presence of mathematics. In the second study (1961 )s scores on the 
Math Attitude Scale contributed significantly to the prediction of the final 
mathematics grades of 67 college women when the scores were combined in 
a regression equation with high school mathematics averages and scores 
on the Verbal Reasoning and Numerical Ability tests of the Differential 
Aptitude Tests. However, the Math Attitude Scale was not a significant 
predictor for the 60 college men. Finally, there were statistically zw 
cant part correlation coefficients for both males (r = .33) and fema sf 
(r = .34) between Math Attitude Scale scores and scores on a retest 0 
the Cooperative Mathematics Pretest for College Students, after initial scores 
on the latter variable had been partialed out. 


Attitude as a Moderator Variable 


i i i Itiple 
The Aiken and Dreger (1961) study is an illustration of the multip 

correlation approach to prediction, in which measures “iain ot 
ability were combined in a regression equation to predict ac as i 
second approach to prediction, which is actually a special ae o : rst, 
is to view attitude as a moderator variable and to determine the corre a 
between ability and achievement separately at each of several Aks An 
attitude. Thus, it may be discovered that the correlation between ability 


and achievement varies with level of attitude. 


Cristantiello’s (1962) study is an example of this moderator variable 
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approach. College sophomore men (N = 264) were classified by area of 
major (business administration, social science, natural science), and within 
each of these areas further categorized into three levels (high, middle, low) 
according to their scores on a scale of attitude toward mathematics, Then 
the correlation between scores on a measure of quantitative ability (ACE- 
Q scores) and mathematics grades was found separately for each of the nine 
major areas by attitude level groups. The correlations between ACE-Q 
scores and mathematics grades were significantly more positive for students 
with middle attitude score, and significantly smaller for those with low 
attitude scores. These results could not be explained by differences among 
the groups in variances of either grades or ACE-Q scores. 

Although Cristantiello’s results should be replicated, they may be in- 
terpreted as indicating that mathematical ability may be a less important 
determiner of the achievement of students who have more extreme attitudes 
toward mathematics than of those who have more moderate attitudes. 
Related to these findings is Jackson’s (1968) conclusion that attitude scores 
in the middle range of scores have little relation to achievement. He 
maintained that it is only at the extremes—highly positive or highly 
negative—that attitude affects achievement in any significant way. 
Jackson is correct, then it is reasonable to expect that in the middle range 
of attitude scores, as was found by Cristantiello, ability scores rather than 


attitude scores will be more accurate predictors or determiners of achieve- 
ment. 


An International Study of Attitudes and Achievement 


In an international study designed to assess the mathematics achieve- 
ment of 13- and 17-year-old (terminal secondary) students in a dozen 
countries (Husén, 1967), extensive data concerning attitudes, interests, and 
certain other variables were also collected. Of the five attitude scales which 
Tat i dministered, three are of particular interest: these were the measures 
o attitudes toward mathematics as a process, attitudes about the difficulties 
of learning mathematics, and attitudes about the place of mathematics in 
hd One of the findings concerning scores on the first scale—a measure 
s a extent to which mathematics is viewed as fixed, as opposed to 

mt oping or changing—was that in all countries studied, the upper-level 
Da pia considered mathematics as less changing than did the 
in satel ue poe There was also a tendency for students 
matics as more open and — ma Was taught to see mathe- 


Scores on the second scal 4 ay 
leaning mathematics) fri fea e (a measure of the perceived difficulty of 


a ted that upper-level students tended to per- 
ee snag as more difficult and demanding. Interestingly enough, 
ores on the third scale (a measure of the perceived role of mathematics 
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in contemporary society) indicated that mathematics was viewed as less 
socially vital or valuable by students with the longest exposure to it and 
by students in countries where English is spoken. 

Some of the correlational results of this international investigation 
were: (1) significant negative rank-order correlations between mean mathe- 
matics achievement and mean scores across countries on the attitude scales, 
(2) rather small positive correlations between achievement and attitude 
within countries, and (3) moderate to high positive correlations between 
achievement and interest measures within countries. The reader should be 
alerted to the difference in the findings “between countries,” where mean 
scores are the unit of analysis, and the findings “within countries,” where 
raw scores are the unit of analysis. Results such as these are not uncom- 
mon and are a matter of group differences versus individual differences. 
In summarizing the results referred to above, Husén (1967, p. 45) con- 
cluded: “We may say, in general, that in those countries where achievement 
is high, pupils have a greater tendency to perceive mathematics as a fixed 
and closed system, as difficult to learn, for an intellectual elite, and as 
important to the future of human society.” 


The Relationships of Attitudes to Personality and Social Factors 


Anxiety and Attitude 


As was noted above, attitudes are affective variables, so some relation- 
ships between a measure of attitude and a measure of anxiety toward a par- 
ticular school subject should not be unexpected. In addition, anxiety and 
attitude may be either general or specific, pertaining to only one situation or 
event or to many. In a number of studies during the past decade, researchers 
(e.g., McGowan, 1960; Reese, 1961) have related scores on the Children’s 
Manifest Anxiety Scale (Castaneda et al., 1956)—presumably a measure of 
debilitating anxiety—to performance in mathematics. Typically, these re- 
searchers found small but statistically significant negative correlations 
between manifest anxiety and achievement; these correlations were usually 
somewhat smaller in absolute value than the correlations between attitudes 
and achievement. Thus, Reese (1961) obtained a correlation of —25 
between scores on the Children’s Manifest Anxiety Scale and arithmetic 
achievement in fourth- and sixth-grade girls, when IQ was partialed out. 


The relationship between attitude toward academic work in general 


and attitude toward mathematics in particular has also been investigated, 


F p i MaN 
although there is an apparent inconsistency in the findings of two inv 

pei In a study ne in Sweden, Werdelin (1966) Scare 
a questionnaire concerning attitudes toward school work in general an 
mathematics in particular to ninth graders. A close relationship between 
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attitudes toward school work in general and attitudes toward mathematics 
was This finding contrasts with that of Aiken and Dreger = 
that a test of independence between scores of college students on the 
Attitude Scale and their scores on four items designed to measure attitudes 
toward school work in general was not significant. There are several _ 
explanations for the difference between the findings of the two 
investigations; age level and nationality were not the same, and 
instruments were not equivalent. Nevertheless, it is possible to construct 
an inventory to measure anxiety or attitude which is fairly specific to 
mathematics (Aiken & Dreger, 1961). p 
Intellective Factors 
Although it has been observed that general ability to learn is associated 
with liking for arithmetic (see Brown & Abell, 1965), measures of anxiety 
attitudes toward school subjects typically have rather low correlations 
measures of intellectual ability (Aiken, 1963; Dreger & Aiken, 1957; 
et al., 1964). Dreger and Aiken (1957) found, for example, that 
in the presence of mathematics had. a statistically non- 
tion of —.25 with ACE Quantitative scores and a correla- 
—08 with ACE Linguistic scores. Lindgren et al. (1964) 
zero correlations between Carey’s (1958) measure of problem- 
attitude and intelligence test scores in a group of fourth-grade pupils 
in Brazil. For a group of 160 college women, Aiken (1963) obtained an 
insignificant correlation between Math Attitude Scale scores and Scholastic 
Aptitude Test (SAT) Verbal scores, but attitude scores were significantly 
pay with SAT Quantitative scores (r = .37). 
wo comments about these data may be made. One might expect 
pian toward a specific subject to be significantly related to a measure 
Cay in that subject because measures of specific ability and specific 


I 


Hi 
H 


i 


Social Factors 


One possible social determiner of attitude toward mathematics is the 
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attitudes of one's peers. Shapiro's (1962) findings indicated that peer 
attitudes in elementary school may indeed be influential, especially in the 
case of girls. The social influences of parents and teachers will be treated 
in detail in later sections of this review. Otherwise, the effects of social 
factors on attitudes toward mathematics appear to be relatively unimportant. 
The fact that negative attitudes toward mathematics are not restricted to 
a small school system is documented by McDermott (1956), who found 
that the backgrounds of students who were alraid of mathematics ranged 
from one-room rural schools to large city school systems. 
Lindgren et al. (1964) reported an essentially zero correlation between 
socioeconomic status and Carey’s (1958) measure of problem-solving atti- 
tudes, and Hungerman (1967) obtained correlations between 


mathematics test scores are usually not as highly related as verbal test 
scores to socioeconomic status (see Karas, 1964). Karas (1964) maintained 
that the home environment has a greater effect on performance in more 
verbal subjects than in subjects such as mathematics that are more highly 
loaded with less familiar symbolic material. Considering the positive re- 
lationship between attitude and achievement, one may generalize from 
Karas’s findings that socioeconomic status and perhaps other home factors 
have less effect on attitude toward mathematics than on attitude toward 
more verbal subjects. I am not aware of any research specifically 

to test this hypothesis, but several researchers have conducted studies on 
the relationships of parental attitudes and encouragement to student atti- 
tudes toward mathematics. 


i 


Parental Influences 


According to Poffenberger and Norton (1959), parents affect the 
child’s attitude and performance in pep ways: (1) by spores aa 
of child’s achievement, (2) by parental encouragement, an pa 
own attitudes. As Pc their hypothesis that the conditioning of 
children’s attitudes occurs in the family, the authors cite the results of a 
study of 390 University of California freshmen. The students filled out a 
questionnaire concerning their own attitudes and the attitudes and expecta- 
tions of their parents. The findings were that the students’ attitudes toward 
mathematics were positively related to how they rated their fathers’ atti- 
tudes toward mathematics. The attitudes of the students were also related 
to their reports of the level of achievement in mathematics which their 
fathers and mothers expected of them. Poffenberger and Norton (1959) 
suggested that attitudes reported for the mothers were not significantly 


565 


REVIEW OF EDUCATIONAL RESEARCH Vol. 40, No. 4 


related to students’ own attitudes because only a small number of students 
indicated that their mothers liked mathematics. 

In a further analysis of self-report data, Poffenberger (1959) found 
that college students who reported a distant relationship with their fathers 
showed a significant tendency to perceive their fathers as disliking mathe- 
matics. In contrast, students who reported a close relationship with their 
fathers did not differ from the total sample of students in their ratings of 
their fathers’ attitudes toward mathematics. However, Poffenberger did 
not interpret these data as offering support for the hypothesis that attitude 
toward mathematics is caused by the warmth of a child’s relationship with 
his father—the masculine identification model. Rather, the results were 
seen as being due to a generalized perception on the part of students, viz., 
children who feel that their parents do not like them (since they are not 
close to them) perceive the parents as negatively oriented to other aspects 
of life as well, in this case mathematics, The relationship between mascu- 
line identification and attitude toward mathematics is treated in detail 
below, but several other studies concerned with parental attitudes and 
expectations are reviewed first. 

Aiken and Dreger (1961) found no significant correlations between 
Math Attitude Scale scores and student reports on the degree to which 
parents emphasized and encouraged school work when the students were 
children. It is noteworthy that although none of these correlations for the 
male or female students was significantly greater than zero, the correlations 
for females were uniformly more positive than for males. 

The three studies reviewed above were concerned with student reports 
of the expectations and attitudes of their parents. More direct information 
on the relationships of student attitudes to parental expectations and 
attitudes was obtained by Alpert et al. (1963) and Hill (1967). Alpert et 
al. (1963) developed a parental interview and questionnaire to determine 

extent to which parental attitudes and values were consistent with 
those of the School Mathematics Study Group and how much they affected 
the attitudes of their seventh-grade children toward mathematics. These 
Were the results: (1) student attitudes, for both boys and girls, were 
positively correlated with the amount of mathematics education desired by 
phi their children; (2) boys’ attitudes were positively correlated 

ld importance which their parents placed on grades and with 
parental demands for higher grades, whereas girls’ attitudes toward mathe- 
matics were negatively related to the importance that their parents placed 
on PSE and (3) student attitudes for both boys and girls were 
positively correlated with parents’ views of competition as good and as 
bor ape in the modern world. An interesting sex difference also occurred 
wi oe to these parent variables. Parents of boys who had positive 
mathematics attitudes tended to view the goal of a junior-high mathematics 
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program as “to aid the intellectual development of the child”; parents of 
girls who had positive mathematics attitudes tended to see the goal of a 
junior-high mathematics program as “ability to deal competitively with 
practical everyday problems,” whereas the parents of girls with negative 
mathematics attitudes tended to view the goal as “to aid the intellectual 
development of the child.” 

Hill (1967) interviewed the fathers and mothers of 35 upper-middle- 
class boys and administered a questionnaire concerned with attitudes toward 
mathematics to their sons. He found that a greater similarity between the 
attitudes of mothers and sons was related to maternal warmth, use of 
psychological control techniques, and low paternal participation in child 
rearing. Parental attitudes and expectations for their sons were not signifi- 
cantly related, but sons did show greater accordance with the expectations 
of their fathers than with those of their mothers. The variables of father 
warmth and degree of participation in child rearing were positively related 
to degree of son’s accordance with father’s expectations. Fathers who had 
greater expectations of masculine behavior on the part of their sons and 
who viewed mathematics as a masculine subject had a higher level of 
aspiration in mathematics for their sons. Quite obviously, Hill’s (1967) 
data cannot be handled adequately by the hypothesis that positive attitudes 
toward mathematics are due to masculine identification. But researchers 
need to look a bit further into the data on sex differences and masculinity 
vs. femininity of interest before drawing conclusions about the adequacy 
of any sex-identification hypothesis. 


Sex Differences 


No one would deny that sex can be an important moderator variable 
in the prediction of achievement from measures of attitudes and anxiety. 
The results of several of the investigations discussed thus far (eg. Aiken 
& Dreger, 1961; Reese, 1961) suggested that measures of attitudes and 
anxiety may be better predictors of the achievement of females than of 
males. Mathematics has traditionally been viewed as more of a man’s 
interest or occupation, and consequently one might expect that males would 
score higher than females on tests of ability and achievement in mathe- 
matics and on scales of attitude toward mathematics. Norms on the 
mathematics sections of tests such as the Differential Aptitude Tests-and — 
the Scholastic Aptitude Tests do indicate higher mean scores for males than ~ 
for females at the high school level; this sex difference has been interpreted 
as being produced by greater cultural reinforcement of interest and pursuit 
of mathematics in males at the higher grade levels. Boys have traditionally 
been viewed as better than girls in problem solving (see Sweeney, 1954), 
but in one recent study of eleventh graders (Meyer & Bendig, 1961) the 
investigators found a superiority on the part of girls in the number and 
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reasoning factors of the Primary Mental Abilities test. In two recent studies 
of sex differences in arithmetic achievement at the elementary-school level, 
the investigators found either no difference between the performance of 
boys and girls or a superiority on the part of girls, depending on the test 
and the grade level (Shapiro, 1962; Wozencraft, 1963). 

More specific to sex differences in attitudes toward mathematics are 
Stright’s (1960) finding that elementary-school girls liked arithmetic better 
than the boys and Dutton’s (1968) finding that girls and boys who had 
studied “new math” were about equal in their liking for arithmetic. Never- 
theless, in studies at the college level (Aiken & Dreger, 1961; Dreger & 
Aiken, 1957) I have consistently found a significantly more positive mean 
attitude toward mathematics in males.* Assuming equivalent samples, the 
difference between the results at the lower grade levels and at the college 
level may be due, as noted above, to differential cultural reinforcement for 
males in mathematical endeavors, beginning at the secondary-school level. 
In addition, any explanation of the discrepancy in results must take into 
account interactions between the sex variable and accuracy of attitude 
measures in the earlier school grades, desire to please the teacher, and 
general rate of academic maturation. 


Masculinity-Femininity of Interests 


A not uncommon finding concerning the interest patterns of those 
who like and those who dislike mathematics is that reported by McDermott 
(1956) in a case-study comparison of 34 college students who feared 
mathematics with 7 students who were proficient in the subject. McDermott 
found that those who had developed a fear of mathematics preferred 
English, social studies, and the arts, but disliked the definiteness of 
mathematics. The students who were proficient in mathematics were critical 
of the vagueness of the humanities and were not interested in majoring in 
that area. A hypothesis related to McDermott’s findings and referred to 
in Feierabend’s earlier review (Feierabend, 1960) is that interest and 
ability in mathematics are a consequence of masculine identification. Since 
1959 there have been several investigations concerned with this hypothesis. 

In one test of the above hypothesis, Lambert (1960) administered 
the American Council on Education Psychological Examination (ACE), an 
arithmetic skills test, and the MMPI to 1372 U.CL.A. undergraduates. 
Group I consisted of 80 students in advanced mathematics or physics 
courses; Group II was composed of 1292 senior education students, Con- 
trary to the masculine-identification hypothesis of Plank and Plank (1954), 


*In an assessment of the attitudes toward math i irlei i 
Mionc ematics of 264 Fairleigh Dickenson 
University students, Roberts (1969) reported no significant sex differences in attitudes, 


but engineering s Pi i ; 
bie ng tudents held more positive attitudes than students in terminal mathe- 
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Lambert found no correlation between mathematical proficiency and MMPI 
Masculinity-Femininity (Mf) scores in either sex for either of the groups. 
The mean Mf score of the 10 female mathematics majors was significantly 
more feminine than that of the 744 female education majors. Finally, there 
was no significant difference between the mean scores of the male mathe- 
matics majors and the male education majors on the Mf scale of the MMPI. 
It should be cautioned that the comparatively small number of female 
mathematics majors in this study casts some doubt on the generalizability of 
the results. In addition, the factor of general intelligence was not taken into 
account in the analysis and may have affected the results. For example, the 
selected group of mathematics majors may have been more intelligent and 
therefore perhaps more interested in cultural (ie. “feminine” pursuits 
than typical persons with positive attitudes toward mathematics. Also, it 
is uncertain how representative the group of education majors was of the 
general college population; a comparison group randomly selected from 
all major fields should have been selected. Finally, the MMPI Mf scale is 
not necessarily the best measure of masculinity-femininity of interests. 
Lambert’s (1960) study was fairly easy to conduct, and it should be repli- 
cated and extended, in light of the criticisms made above, at other schools 
and colleges and with other measures of masculinity-femininity. 


In another test of the masculine-identification hypothesis at the college 
level, Carlsmith (1964) obtained student’s reports of the length of time 
that their fathers had been absent from home when the students were 
children. These “time reports” were compared to the students’ scores on 
the Verbal and Mathematical sections of the Scholastic Aptitude Test 
(SAT) and to the difference between SAT-Verbal and SAT-Mathematical 
scores. For both boys and girls, the longer the father was absent from the 
child during early childhood the lower was the latter’s mathematics score 
relative to his verbal score. An additional finding was that if the father 
was absent for a short period of time during a boy’s adolescence, the 
boy’s mathematics score was higher than in cases where the father was 
not absent at this time. As an explanation of these results, Carlsmith re- 
jected the idea that separation from the father produces anxiety and anxiety 
affects mathematics scores more than verbal scores. He maintained that 
the “masculine conceptual approach,” which is needed in order to achieve 
in mathematics, is acquired through close and harmonious association with 
the father. Certainly Carlsmith’s investigation, like that of Lambert (1960), 
should be replicated, because there is an apparent disagreement between the 
results of the two studies. However, one must be cautious about the method 
used to measure masculine identification. As a way of linking the results 
of the Carlsmith and Lambert studies, it may be of interest to determine 
the relationship between father absence during early childhood and scores 
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on a masculinity-femininity interest measure such as the MMPI Mf scaled 
Elton and Rose (1967) tested the hypothesis that girls avoid mathe- 
matics because they view it as a masculine activity. It was predicted that 
girls who had high scores on the English section but only average scares 
on the Mathematics section of the American College Test (ACT) would 
show more feminine interests on the Omnibus Personality Inventory (OPI). 
In contrast, girls with average scores on the ACT English section and 
high scores on the ACT Mathematics section should presumably manifest 
more masculine interests on the OPI. The scores on the ACT and OPI of 
females in the 1962-1965 classes at the University of Kentucky were 
analyzed. Students’ scores were classified as low, average, and high on the 
ACT Mathematics and English tests, and the data on students showing 
seven of the nine possible combinations (e.g., high in English and low in 
Mathematics, or low in English and average in Mathematics) were related 
by multiple discriminant analysis to their factor scores on the 16 OPI 
scales. The results indicated that girls in the high English-high Mathe- 
matics group had more theoretical and fewer esthetic (i.e., more mascu- 
line) interests. The difference between masculinity-femininity of interests 
was also in the predicted direction for the low English-average Mathematics 
and average English-low Mathematics groups, the former group showing 
more masculine interests on the OPI than the latter group. Thus, as in 
the Carlsmith (1964) study, masculine identification, or masculine role, 
was a predictor of large differences between verbal and mathematical scores 
on a college entrance examination. The findings of Elton and Rose (1967), 
however, are perhaps more easily accepted and interpreted than those of 
Carlsmith, because they do not require that researchers reach back to an 
event in a person’s early childhood as an explanation of the difference 
between his verbal and mathematical scores on a college admissions test. 
Many of the items on the OPI concern interests in reading, science, and 
other verbal-related and mathematics-related pursuits. It is not surprising 
that girls with more verbal-related (viz., “cultural”) interests, as measured 
by the OPI, should also have higher verbal-ability than mathematical- 
ability scores, whereas girls with more mathematics-related (viz., scientific, 
theoretical) interests should have higher mathematical-ability than verbal- 
ability scores. | It is not necessary to argue whether the scientific-theoretical 
(masculine) interest or the mathematical ability came first, or whether 
the cultural (feminine) interests or the verbal ability came first. The two 
*Another interesting hypothesis of Carlsmith (1964) is that aptitude for mathematics 
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factors—interest and ability—form a mutually reinforcing system. There 
arc obviously many other factors that enter into the system, but in general 
people tend to like those things which they do well, and, perhaps to a 
more limited extent, they tend to do well in those things which they like. 


Sex role is only one of the personality variables which are related 
to attitude toward and performance in mathematics. Obviously, there are 
many sources of within-sex differences in attitudes. These differences in 
attitudes are certainly related to differences in ability, but they may also 
be related to other personality variables. In addition, it may be of interest 
to review some of the investigations concerned with the relationships 
between achievement in mathematics and personality variables other than 
sex role. The results may shed some light on the dynamics of attitudes 
toward mathematics. 


Other Personality Variables 


Correlations With Attitudes. In an initial study employing the Math 
Attitude Scale, Aiken and Dreger (1961) found little relationship between 
mathematics attitudes and scores on the seven of the Minnesota 
Counseling Inventory (MCI). The MCI Leadership scale had the highest 


of 67 college women. More evidence on the relationships of mathematics 
attitudes to a broad constellation of personality variables was obtained by 


on three personality inventories—the California Psychological Inventory 
(CPI), the Sixteen Personality Factor Questionnaire (16 PFQ), and the 
Allport-Vernon-Lindzey Study of Values (SV). When scores on the SAT- 


statistically significant—the correlations between mathematics attitudes and 
CPI Dominance, CPI Self-Control, CPI Achievement via Conformance, CPI 
Intellectual Efficiency, 16 PFQ Integration, and SV Theoretical scale. 
Aiken (1963) interpreted these results as demonstrating that high scorers 
on the Revised Math Attitude Scale, with mathematical ability statistically 
controlled, tend to be more socially and intellectually mature, more self- 
controlled, and to have more theoretical interests than low scorers on the 
scale. 

Correlations With Achievement. Feierabend (1960, pp. 21-23) devoted 
three pages of her review to research relating personality variables to 
achievement in mathematics. Since achievement and attitude are related, 
it may be of interest to summarize briefly the results of two studies on 
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the topic which have been completed since 1959. Cleveland (1962) 
` divided a group of sixth-grade pupils into three IQ ranges on the basis 
of their scores on the California Test of Mental Maturity: 75-89, 90-110, 
and 111-125. Although scores on the California Test of Personality (CTP) 
did not significantly differentiate between low achievers and high achievers 
in mathematics among children in the 75-89 and 111-125 IQ ranges, 
there were several significant differences in personality test scores between 
low and high achievers in the 90-110 IQ range. High achievers in this 
IQ range had significantly higher scores than low achievers on CTP Sense 
of Personal Worth, CTP Sense of Personal Freedom, and CTP Community 
Relations. The investigators attributed the lack of significant differences 
in personality between low and high achievers in the 75-89 and 111-125 
IQ ranges to the relatively greater influence of differences in intelligence, 
as compared to nonintellective variables, in these ranges. 

Collectively, the findings of studies relating personality variables to 
mathematics attitudes and achievement indicate that individuals with more 
positive attitudes and higher achievement tend to have better personal and 
social adjustment than those with negative attitudes and low achievement. 
These results must be kept in p' cspective, however. The correlations are 
relatively low, and it is a truism that correlation does not imply causation. 
Personal-social adjustment, attitudes, and achievement not only interact 


with each other, but they are the effects of other home, school and 
community variables. 


Teacher Characteristics, Attitudes, and Behavior 


i It is generally held that teacher attitude and effectiveness in a par- 
ticular subject are important determinants of student attitudes and per- 
formance in that subject. As an example of research bearing on this 
supposition, Torrance et al. (1966) studied 127 sixth- through twelfth- 
grade mathematics teachers who participated in an experimental program 
to evaluate SMSG instructional materials. Pre- and posttests of educational 
and mathematical progress, aptitude, and attitude were administered to 
the teachers and their pupils. The result was that teacher effectiveness 
had a positive effect on student attitudes toward teachers, methods, and 
overall school climate. 

It is also true that students who do not do well in a subject may 
develop negative attitudes toward that subject and blame their teachers 
for their failures, even when the teachers have been conscientious. Thus, 
it is possible to interpret the findings of Aiken and Dreger (1961) as being 
due as much to “sour grapes” on the part of the students as to objective 


4For a comprehensive review of i 
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characteristics of their mathematics teachers. A result of this investigation 
(Aiken & Dreger, 1961) was that college men who disliked mathematics, 
as contrasted with those who liked mathematics, stated that their previous 
mathematics teachers had been more impatient and hostile. College women 
who disliked mathematics, in contrast to those who liked mathematics, 
tended to view their previous mathematics teachers as more impatient. 
not caring, grim, brutal, dull, severely lacking in knowledge of the subject, 
and not knowing anything about how to teach mathematics. In many of 
the correlational studies reviewed below there will be a similar problem of 
deciding which variable is cause and which effect, or, as was discussed 
above, whether the two variables form a mutually reinforcing system. 
Despite the difficulty of making clear interpretations, the results of these 
investigations may stimulate more controlled research on the topic. 


Interactions Between Teacher Attitudes and Student Attitudes 


Garner (1963) administered an inventory concerning attitudes toward 
algebra to 45 first-year algebra teachers and their 873 Anglo-American and 
290 Latin-American pupils in a Texas school system at the beginning and 
end of the school year. Beginning attitudes, judgments concerning the 
practical value of algebra, and algebra achievement were significantly 
higher in Anglo-American than in Latin-American pupils. Significant 
relations were found between: (1) teacher's background in mathematics 
and students’ achievement in algebra; (2) teacher’s attitude toward algebra 
and students’ attitudes; (3) teacher’s and students’ judgments concerning 
the practical value of algebra; (4) teacher’s attitude and changes in atti- 
tudes toward algebra in the Latin-American students. 


Peskin (1965) studied the relationship of teacher attitude and under- 
standing of seventh-grade mathematics to the attitudes and understanding 
of students in nine New York City junior high schools. Correlations were 
computed between the scores of teachers and students on six tests of attitude 
toward and understanding of arithmetic and geometry. The correlations 
between teachers’ and students’ understandings of algebra and geometry 
were significantly positive, as were the correlations between teachers under- 
standing scores and students’ attitudes. The relationships of teacher 
understanding and attitude to student achievement and attitude were 
complex. For students having very high or very low levels of achievement, 
the correlations between teacher understanding and student achievement 
were significantly positive in the cases of both arithmetic and geometry. 
However, the correlation between teacher understanding and student atti- 
tude was significantly negative for the very high-level group in geometry. 
There was also an interaction between teacher attitude and understanding: 
teachers with a “middle” attitude and a “high” understanding had students 


573 


REVIEW OF EDUCATIONAL RESEARCH Vol. 4, Ne. 


with the best scores in geometry, but teachers with “high” unders 
and “low” attitudes had students with the poorest achievement in arith- 
metic and geometry. r 

Cross-Lagged Panel Correlation. These results pose again the 
“chicken-egg” or cause-effect question referred to previously. In short, do — 
teacher attitudes and achievement affect student attitudes and achievement 
or vice versa? Simple correlational analysis cannot answer this question, 
but there is a correlational procedure which may give some information 
on whether the pupil or the teacher has the greater effect on the other's 
attitude and achievement. Campbell and Stanley (1963, pp. 68-70) dis- 
cussed such a design involving time as a third variable, which they referred 
to as “cross-lagged panel correlation.” As an illustration of this approach, 
suppose that an attitude scale is administered to a group of teachers and 
their students at time | (pretest) and readministered at time 2 (posttest). 
Then the correlation between teachers’ attitudes at time 1 and the means 
of the attitude scores of their students at time 2 (ri) is computed; the 
correlation between teachers’ attitudes at time 2 and the means of the 
attitude scores of their students at time 1 (ra) is also computed. Then 
if ra is significantly more positive than rə, this is evidence that teachers’ 
initial attitudes had a greater effect on final (mean) student attitudes 
than initial (mean) student attitudes had on final teacher attitudes. How- 
ever, if rz: is significantly more positive than r,. this is evidence that initial 
(mean) student attitudes had a greater effect on final teacher attitudes 
than initial teacher attitudes had on final (mean) student attitudes. A 
similar approach can be used to study the effects of teacher attitudes or 
achievement on student achievement, or vice versa. The data collected by 
Garner (1963), in which teachers’ and students’ attitudes and achievement 


were measured before and after some treatment time interval, lend them- 
selves to this sort of analysis. 


Other data concerning the relationships between teacher characteristics 
and student attitudes were reported by Alpert et al. (1963). They found 
boys’ attitudes toward mathematics to be more positive when the teacher — 
was more theoretically-oriented and involved, regardless of the teacher's 


sex. However, there was an interaction between the sex of the teacher 


and the pupil in terms of the effects on student attitudes of more subjective, 
interpersonal factors such as 
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Aptitude Test-Mathematical and an objective mathematics achievement test 
administered at mid-semester. The students also took the Alpert-Haber 
Achievement Anxiety Test and an opinionnaire designed to assess students’ 
perceptions of the classroom characteristics of their teachers. This procedure 
was an extension of the McKeachie technique for obtaining measures of 
four types of motivating cues used by the teacher in the classroom: cues 
for achievement, affiliation, orderliness, and test and feedback. The data 
for the six student groups (male and female underachievers, average 
achievers, and overachievers) were analyzed by multiple discriminant 
analysis of the four measures of teacher motivating cues and two student 
anxiety variables. The results showed that, in general, girls were more 
sensitive than boys to the motive-arousing cues of their teachers, and girls 
were also significantly higher on debilitating . Girls in all three 
achievement level groups perceived a lower number of teacher achievement 
cues than boys, and there were no significant differences among the three 
groups of girls on this variable. White and Aaron suggested that teacher- 
achievement cues were less effective with girls because the girls may already 
have been at an optimum level of achievement motivation. findings 
were that high-achieving students seemed to be more perceptive of teacher 
cues emphasizing grades and success in mathematics, but underachievers 
perceived their teachers as less highly achievement-motivated. Under- 
achieving girls tended to perceive more affiliative, friendly, warm cues and 
fewer achievement cues from the teacher. Finally, girls in general tended 
to be more responsive to controlled, conforming behavior on the part of 
the teacher and to react more to extrinsic rewards and punishments from 
the teacher. 


Reasons for Liking or Disliking Arithmetic Among Teachers 
and Prospective Teachers 


Assuming that teacher attitudes can be communicated to students and 
can affect the attitudes and performance of students, it may be of interest 
to determine what percentage of elementary school teachers like or dislike 
arithmetic and what their reasons are. 

Stright (1960) concluded that a large percentage of elementary school 
teachers really enjoy teaching arithmetic and try to make it interesting. 
The teacher’s age, education, and experience apparently had little effect 
on her attitude toward teaching arithmetic. It is a reasonable observation, 
however, that the attitudes of elementary teachers toward mathematics are 
typically less positive than those of secondary school mathematics teachers 
(see Wilson et al., 1968, No. 9). 

During the past ten years, Dutton (1962, 1965) and others (eg., Reys 
& Delon, 1968) have conducted a number of studies concerned with the 
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attitudes of prospective elementary teachers toward mathematics.- In a 
survey at U.C.L.A., Dutton (1962) found that 38% of 127 elementary 
education majors had unfavorable attitudes toward arithmetic. More re- 
cently, Reys and Delon (1968) reported that only about 60% of the 385 
University of Missouri education majors whom they surveyed had favorable 
attitudes toward arithmetic. In Dutton’s (1962) study, those who disliked 
arithmetic gave reasons such as: word problems, boring work, long problems, 
and lack of understanding. Those with favorable attitudes pointed to 
aspects of arithmetic such as: useful, practical applications, definite, pre- 
cision of concepts, and fun just working with numbers. One shortcoming 
of Dutton’s (1962) study is that he attempted to draw conclusions about 
changes in attitudes over the years since an earlier survey with a different 
group of examinees. If one finds that a current sample of prospective 
teachers fills out an attitude inventory differently from an earlier sample, 
it could mean that attitudes have changed in the intervening years. But 
an equally plausible explanation is that the differences are caused by 
sampling errors. 
In a study quite similar to Dutton’s (1962), and suffering from some 
of the same limitations, Smith (1964) compared the attitudes of 123 
prospective teachers in the early 1960’s with those reported by Dutton 
for another group 10 years before. Among the reasons that Smith’s subjects 
gave for disliking arithmetic were: lack of understanding, written problems, 
poor teaching, failure, lack of teacher enthusiasm, too much long work, 
and afraid of arithmetic. In another survey of prospective elementary 
school teachers’ reasons for liking or disliking arithmetic (White, 1964), 
the most frequent reasons given for disliking the subject were: working 
word problems, difficulty with specific skills such as division, fractions, 
square roots, percentages, and the manner in which arithmetic was taught 
in elementary school. Prospective teachers indicating more favorable reac- 
tions to arithmetic were in the majority; they gave the following reasons 
for liking the subject: its challenge, its practical application, its exactness, 
appreciation of specific skills, and solving problems. 
ei reasons given in these tKree studies (Dutton, 1962; Smith, 1964; 
lite, 964) for disliking arithmetic are quite similar. Some are stimulus 
variables—word problems, boring work, inadequate teachers, and some are 
organismic or response variables—failure to understand and fear. A good 
estimate is that these represent the reactions of approximately one-third of 


prospective elementary-school teachers and perhaps of college students in 
general (see Dreger & Aiken, 1957). ‘ 


Relationships of Prospective Teachers’ Attitudes to Their Training 


` i investigators dealt with the relationship between the attitudes 
and achievements of prospective teachers in teacher-training courses. Un- 
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fortunately, the majority of these investigations employed experimental 
designs that were inadequate for answering the questions that the investi- 
gators posed. The most popular designs—the one-group, pretest-posttest 
design and the static two groups comparison—suffer from somewhat dif- 
ferent, but equally telling failures of control (see Campbell & Stanley, 
1963). Therefore, the results of these investigations should be viewed as 
heuristic but not conclusive. 


An example of a pretest-posttest study without a control group is 
that of Reys and Delon (1968), in which the Dutton Attitude Scale was 
administered to 386 University of Missouri students before and after they 
took one of three courses in mathematics education. The researchers found 
a significant decrease from pre- to posttest in the percentage of students 
agreeing with the following statements on the attitude scale: “I avoid 
arithmetic because I am not very good with figures” and “I am afraid of 
doing word problems.” An increase was observed in the percentage of 
students agreeing with the statements: «Arithmetic is very interesting” and 
“I like arithmetic because it is practical.” 


Dutton (1965) used a one-group design to assess changes in both 
attitudes and achievement resulting from intervening instruction. The 
subjects were 160 prospective elementary teachers, who were administered 
an arithmetic comprehension test and an attitude scale as pretests and 
posttests. Although mean posttest score was significantly higher than mean 
pretest score on the arithmetic comprehension test, the rise in mean attitude- 
scale score was insignificant. Dutton noted that 25% of the prospective 
teachers maintained their unfavorable attitudes toward arithmetic in spite 
of the instructions. 

A similar design was employed by Purcell (1965), who was concerned 
with the relationships of attitude change to increased understanding of 
arithmetic concepts and to grades in an elementary arithmetic methods 
course. Although pretest scores in understanding concepts were positively 
correlated with attitudes and with grades in the arithmetic methods course, 
change in understanding of concepts was not significantly related to change 
in attitude or course grade, and change in attitude was not related to course 
grade. However, there were significant improvements in understanding of 
concepts and in attitudes toward arithmetic. 

Gee (1966) gave pre- and posttests of basic mathematics understanding 
and attitudes toward mathematics to 186 prospective elementary teachers 
in a required mathematics course at Brigham Young University. The 


following results were reported: (1) a significant improvement in attitudes 
toward mathematics and a gain in basic un 


derstanding of mathematics by 
the students while they were enrolled in the course, (2) a significant 
correlation between pretest attitude and final grades, (3) nonsignificant 
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correlations between pretest attitude and change in understanding of 
mathematics, and (4) a nonsignificant correlation between changes in 
attitudes and changes in understanding of mathematics. 


Attitudes and Training in Experienced Teachers 


To assess the relationship of amount of teachers’ training and ex- 
perience to their attitudes and understanding in arithmetic, Brown (1961) 
compared measures of attitudes and achievement in experienced and in- 
experienced teachers. His findings were that the experienced teachers had 
more positive attitudes toward arithmetic and a better understanding of 
basic arithmetic concepts, but no significant relationship was observed 
between the number of years of teaching experience and either attitude 
or understanding. 


Todd’s (1966) purpose was to evaluate the effects of a course, “Mathe- 
matics for Teachers,” which was taught in various locations throughout 
the state of Virginia in 1964, on attitude toward arithmetic and change 
in understanding of mathematics. He concluded that the course produced 
significant changes in attitudes toward arithmetic and arithmetic under- 
standing for the teachers who completed it.® 


Two-Group Designs 


In the two investigations summarized below, the researchers used a 
two-group design, which allows for more control over extraneous variables 
than the one-group design in determining the effects of particular treat- 
ments. However, in the studies reviewed the subjects were not assigned 
at random to the two groups; attempts were simply made to ascertain that 


the two groups did not differ on variables which were not controlled in 
the investigation. 


‘It may be appropriate to insert a note on s E, 
; gains or change score at this point. Since 
abe bisa minus pretest difference scores are Brana with pretest scores, the 
initial level of ability, achievement, or attitude is not controlled when simple gain 
initial are used. procedure for eliminating the correlation between gains and 
t al scores is to compute, as a measure of gains, the residual deviations of individuals 
th ua posttest scores from their predicted posttest scores. The latter are estimated from 
k h be equation for predicting posttest scores from pretest scores. However, one 
A more, dieace: Gases 4 ara m ual gain scores in order to apply the concept. 
with each other and wiih i Rb the correlations of pretest and posttest scores 
pretest Scores ao Part correlation between posttest scores and the third variable, with 
IP eater na partialed out of the former, is computed (see Thorndike, 1963, pp- Ta 
residua 
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the situation in which each is appropriate. 
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Rice (1965) was interested in determining whether formal instruction 
in modern mathematics influences teacher attitudes toward modern mathe- 
matics in particular and mathematics in general. He mailed questionnaires 
concerning experiences with modern mathematics and attitude toward 
mathematics to a large number of elementary teachers in Oklahoma. 
Four hundred of the 608 replies were analyzed by analysis of variance, 
chi square, and other statistical procedures. From the results, Rice con- 
cluded that teachers who had formal instruction in modern mathematics 
have more favorable attitudes toward modern mathematics and toward 
mathematics in general than do teachers who have had no such training. 
Among the teachers who reported that they had training in modern 
mathematics, there was a significant difference in attitudes toward modern 
mathematics in favor of those who had taught in a modern program. 
Attitudes toward modern mathematics were also more favorable among 
those who had more training in modern mathematics and among those 
with more than four years of college. Finally, attitudes toward modern 
mathematics were found to be unrelated to age, experience, and sex. 

In the strictest sense, Rice’s (1965) investigation is a correlational 
study rather than an experiment. Somewhat more “experimental ” in nature 
is the investigation by Wickes (1968), who wished to determine the effects 
of two different arrangements (pre-requisite vs. consolidated) of courses 
concerned with concepts in elementary-school mathematics on prospective 
teachers’ attitudes and understanding of mathematics. In one arrangement, 
the completion of a specially designed mathematics course was prerequisite 
to enrollment in a course in methods of teaching elementary mathematics. 
A second arrangement was a single consolidated course in which Soi 
and methodology were interrelated. The “pre-requisite” group it 
of 65 students at Baylor University who had taken the first ae um 
arrangement in two preceding years. The “consolidated” group, W moe 
comparable to the first group on pretest measures of attitudes and w aa, 
ment, was composed of 104 students who completed the consolidat 
course. Pre- and posttest scores on an attitude scale and a eS 
mathematics concepts test were available for both groups, and it was verifi 
statistically that the two groups were comparable in their pretest scores 
on these variables. The results showed that both course arrangements 
produced statistically significant gains in mathematics attitudes and under- 
standing of fundamental mathematics concepts. The pre-requisite group 
showed significantly greater gains in understanding of operant arid 
cepts, but the two groups did not differ in gains on the attitude scale. 
Wickes concluded that, all things considered, the two-course sequence 
was more effective than the consolidated course. hE ee 

ults of the investigations reviewed above indicate 
that pet e work in mathematics can affect the attitudes 
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and achievement of teachers and teacher-trainees. But what has recent 
research revealed about the effects of instructional method and curriculum 
on the attitudes and mathematics achievement of students in the public 
schools? 


Instructional Method and Curriculum 
Rote Memory vs. Meaningful Teaching 


In a discussion of a variety of unpleasant experiences in the earlier 
grades that cause students to avoid high-school mathematics, Wilson (1961) 
concluded that a primary cause is “drill beyond the fundamental processes.” 
Bernstein (1964) apparently concurred with Wilson’s conclusion that 
mathematicians and teachers are almost universally agreed that rote learn- 
ing procedures are a major factor in producing negative attitudes toward 
mathematics. Collier (1959) also maintained that teachers should empha- 
size computational speed less and place more stress on developing mathe- 
matical understanding and logical reasoning ability. 


Clark (1961) suggested that reliance on rote memory rather than 
logical reasoning is a consequence of the assignment of formal arithmetic 
at too early a grade. In his opinion: “Children are often confronted in 
school with situations which few adults would tolerate. Day in and day 
out there is repetition of meaningless expressions, terms, and symbols. 
Eventually, many children come to dislike arithmetic. Lack of under- 
standing and skills is associated with personality maladjustment and delin- 
quent behavior, including truancy and incorrigibility [p. 2].” 

In a study of fourth-grade pupils in a Georgia school, Lyda and 
Morse (1963) noted positive changes in attitudes toward arithmeti> and 
significant gains in arithmetic computation and reasoning when a “mean- 
ingful method” of teaching the subject was employed. Their me ‘hod 
emphasized the mathematical aim of arithmetic: stressing the concept of 
number, understanding of the numeration system, place value, the use of 
fundamental operations, the rationale of computational forms, and the 
relationships which make arithmetic a system of thinking. 


Another way that has been suggested for making arithmetic more 
meaningful, or at least more interesting, is televised instruction. Kaprelian 
(1961) administered a questionnaire to 65 fourth-grade pupils to obtain 
their reactions to the television program “Patterns in Arithmetic.” Over 
90% of the pupils approved of the program to some extent, and over 75% 
said that they liked arithmetic better after viewing the new arithmetic 
television program. Finally, 75% of the pupils stated that their attitudes 


toward arithmetic had changed because the television program helped 
them to understand the subject. 
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Effects of Ability Grouping 


Grouping pupils in arithmetic classes according to their abilities has 
frequently been criticized as leading to poor attitudes, either directly or as 
a result of parental attitudes toward grouping. To study the effects of 
ability grouping on attitudes, Lerch (1961) compared the change in atti- 
tudes toward arithmetic of fourth-grade pupils taught intermittently in 
ability groups with the changes in attitudes of pupils taught in traditional, 
non-grouped classes. Differences in scores on the pre- and posttest attitude 
inventories showed that more than half of the pupils in both groups became 
more favorable in their attitudes toward arithmetic. The average change 
in attitude of the ability-grouped classes, however, was not significantly 
different from that of the nongrouped classes. It was concluded that 
children’s attitudes toward arithmetic are less dependent upon classroom 
organization than on their teachers’ attitudes and the methods which the 
teachers employ. 


In another study of the effects of ability grouping, Davis and Tracy 
(1963) compared the pre- and posttest scores on the California Arithmetic 
Test of 393 North Carolina fourth-, fifth-, and sixth-graders, The two 
types of programs were a Joplin-type plan (ability grouping) and a random 
plan (nonability grouping). It was ascertained that initially the two 
groups did not suffer significantly in their scores on measures of ability, 
self-concept, anxiety, and attitudes toward arithmetic, which were ad- 
ministered as pretests. Thus, attitude toward arithmetic was a concomitant 
variable, rather than a criterion variable, in this study. The results were 
that pupils in the Joplin-type plan did not gain significantly more in 
arithmetic achievement than pupils in the random plan. Consistent with 
the conclusion of Lerch (1961) referred to above, Davis and Tracy (1963) 
concluded that differences among teachers in their knowledge of arithmetic 
their attitudes toward arithmetic, and variability in their method of teach- 
ing—factors which were not controlled or measured in this study—are 
important variables to consider in future research on ability grouping. 


School Mathematics Study Group (SMSG) Curriculum 


In a discussion of motivations in mathematics, Bernstein (1964) sug- 
gested that organization of subject matter, such as that in SMSG, may 
improve attitudes toward mathematics. Unfortunately, studies have typically 
failed to verify Bernstein’s suggestion: the teacher, rather than the curri- 
culum, still appears to be the more influential variable as far as attitudes 
are concerned. For example, in a comparison of SMSG and non-SMSG 
seventh-grade classes, Alpert et al. (1963) observed that the SMSG cur- 
riculum did not increase students’ positive feelings toward mathematics, 
either absolutely or when compared with the non-SMSG curriculum. How- 
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ever, teachers with a highly theoretical orientation tended to produce more 
positive feelings in SMSG classes, but not in non-SMSG classes. ae í 
measures of attitudes, Alpert et al. (1963) found that the attitudes y 
mathematics of SMSG students became less positive from fall to 

testing, whereas the attitudes of non-SMSG students remained rela 
constant. 

Similar results have been obtained by other investigators who have 
compared SMSG and traditional curricula in elementary and junior high 
school (Hungerman, 1967; Osborn, 1965; Phelps, 1964; Woodall, 1967). 
In general, in these studies the investigators found that the mean mathe- 
matics attitude scores of students taught by the SMSG curriculum was not 
significantly greater than (and even more negative in some reports, e.g- 
Osborn, 1965) the mean attitude score of students taught mathematics 
the traditional curriculum. For achievement, in one study (Osborn, | 
results favoring the SMSG curriculum were found, in another study results 
favored the traditional curriculum (Hungerman, 1967), and in still another 
no significant difference was found between the two types of program 
(Woodall, 1967). In general, scores on conventional standardized tests of 
achievement in mathematics tended to favor the traditional, non-SMSG 
curriculum, but scores on more specialized tests such as those constructed 
by the School Mathematics Study Group for use with its materials tended 
to favor the SMSG curriculum. 


One wonders why the SMSG curriculum fails to produce more positive 
attitudes toward mathematics, as Bernstein (1964) hoped that it would. 
One suggested explanation (Osborn, 1965) is that the SMSG curriculum 
is more abstract and demanding than the traditional curriculum, which 
causes students’ attitudes to fail to change at all or to even become more 
negative as the length of time that they study the SMSG program increases. 

Before one goes too far in interpreting the above results, however, it 
should be emphasized that in these investigations the available subjects 
were not assigned at random to the two types of curricula. The investiga- 
be merely analyzed data obtained from existing groups. In some cases 
aiff gph attempted to assure themselves that the groups did not 
my miners in their pretest scores; in other cases the investigators 
Mt analysis of covariance in an attempt to control for initial group dif- 
neice But — random assignment of subjects to conditions, there 
ribs e control over extraneous variables. As Elashoff (1969) pointed out 
s er discussion of analysis of covariance, a crucial assumption underlying 
this statistical method is that subjects have been assigned at random to 
treatments. Therefore, many of the conclusions of the studies reviewed 
above must be viewed as tentative until more controlled research is done. 
In addition, it would be advisable in change studies such as those described 
above to examine pretest and posttest scores for individuals as well as mean 
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Other Modern Mathematics Programs 


Correlational data which show that students in certain special public 
school programs have more positive attitudes toward the subject than stu- 
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to or selected for the program because of their attitudes 
the subject. A case in point is the finding of Ellingson (1962) 


attitudes toward mathematics than students in terminal or 
matics classes. The self-selection factor, where those with 
attitudes and higher ability elect the special course, and the 
morale of being in a “new” program can obviously influence the 
investigations in which there is no true control 

Research designs similar to those of studies 
section on the SMSG curriculum have been employed to 
mathematics programs with the traditional program. For 
(1967) compared the college mathematics achievement a 
students who had the University of Illinois Committee on School Mathe- 
matics (UICSM) program in public school with those of students who had 
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total number of semesters of college mathematics taken, major field of study, 
number of elective mathematics courses taken, types of mathematics courses 
taken, and over-all mathematics grade-point averages were obtained from 
transcripts and questionnaires. There were few differences between the 
two groups in college mathematics achievement after the criterion scores 
of the UICSM and non-UICSM groups were adjusted by covariance analy- 
sis on numerical and verbal aptitudes, high-school grade averages, school 
size, and percentage of each school’s students who went to college. The 
UICSM group did take significantly more college mathematics, however, 
and did as well as the non-UICSM students. In addition, the UICSM 
students had significantly more favorable mathematics attitudes than the 
non-UICSM group. 

In an investigation by Yasui (1968), a modern-mathematics group 
studied the Secondary School Mathematics textbook series in grades 10, 11, 
and 12 in Edmonton, Alberta public schools. A control (traditional) 
group consisted of 125 students selected from high schools not exposed to 
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modern mathematics. After it was “adjusted” for individual differences in 
scholastic ability with ninth-grade scores on the School and College Ability 
Test, the mean score of the modern-mathematics group was significantly 
higher than that of the traditional group on test items of the Contemporary 
Mathematics Test which contained material common to both curricula. Al- 
though the difference between the mean scores of the two groups on an 
inventory of attitudes toward mathematics was not significant, attitude 
scores were significantly correlated with achievement in both groups. It 
may be observed that the two groups in this study were not equated for 
initial attitude toward mathematics; therefore, it is not certain what the 
failure to find a significant difference in mean attitude in the twelfth grade 
indicates. Perhaps the two groups had equivalent mean attitude scores to 
begin with and became more positive, more negative, or remained the 
same; or perhaps one group had more positive attitudes than the other at 
the outset, and the initially more positive group became more negative, or 
the initially more negative group became more positive. 

The aim of Ryan’s (1967) project, which involved 126 pairs of mathe- 
matics classes in schools distributed throughout a five-state area, was to 
compare the effects of three experimental “modern” programs in secondary 
mathematics—the Ball State, UICSM, and SMSG programs—on the atti- 
tudes and interests developed in ninth-grade pupils. Self-report measures 
of attitudes and interests were administered to the students at the begin- 
ning and end of the school year, and systematic observations of behavorial 
signs of student interest were made. Pupil characteristics such as sex and 
achievement level, and teacher characteristics such as experience with the 
programs were considered in the data analysis, The general finding was 
that the experimental programs, when compared with the conventional 
mathematics programs, had little differential effect on the attitudes and 
interests of the pupils. There was a slight tendency, however, for the Ball 
State program to be related to the development of less positive attitudes and 
the UICSM program with more positive attitudes toward mathematics, 
when compared to conventional programs. The less positive attitude of the 
students using the Ball State program was associated with the reported 
greater difficulty which they had in understanding their materials. Meas- 
ured pupil and teacher characteristics did not interact significantly with 
type of program in determining its influence, but change in attitude was 
generally related to change in grade received relative to the previous year 
and to the degree of difficulty which pupils experienced with the materials. 


Other Curriculum Comparisons 


: Especially noteworthy for its attempt to control extraneous variables 
is an investigation by Devine (1967), who compared program-centered 
with teacher-centered teaching of first-year algebra. In each of two high 
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schools two classes (an experimental and a control class) were selected 
at random, but subjects were not assigned at random to the classes. 
Achievement and attitude tests were administered at various times during 
the school year. A particularly interesting result of the study was the 
obtained interaction between teacher experience and type of curriculum 
in their effects on student achievement in mathematics. When the teacher 
was experienced the mathematics achievement of the program-centered 
group was lower than that of the teacher-centered group; there was no 
change in either group in attitude toward mathematics or toward pro- 
grammed materials. When the teacher was inexperienced the program- 
centered group achieved as well as the teacher-centered group, but the 
attitudes toward mathematics and toward programmed instruction became 
more negative in both groups. In a summary of the results, Devine (1967) 
concluded that when an average or above average teacher is available, 
greater achievement is obtained in a conventional, teacher-centered class- 
room approach. 


A curriculum investigation by Maertens (1968) may be cited as much 
as an illustration of a controlled experiment as for its specific results. The 
experiment was designed to assess the differential effects of the curriculum 
practice of assigning homework in arithmetic on the attitudes of third-grade 
pupils toward school, teacher, arithmetic, homework, spelling and reading. 
There were three treatments: control (no homework), common practice 
(regular teacher assigns homework), and experimenter-prepared homework. 
Pupils were randomly assigned to three classrooms within each of four 
schools, and within each classroom pupils were classified into three levels 
according to intellectual ability. The data from five subjects in each of = 
three ability groups within each of the 12 classrooms were analyzed by 


statistically significant differences among the three treatments, Maertens 
concluded sia? arithmetic homework does not uniformly affect pupils 
attitudes toward arithmetic and the other five sources referred to above. 


Consequently, teachers need not omit purposeful arithmetic ay is as 
a general practice because of fear that it may create negative pupil attitudes. 


Developing Positive Attitudes and Modifying Negative Attitudes 


Alpert et al. (1963), on the basis of the results of their research on 


i i i i dents’ 
the SMSG am, made the following suggestions for improving stu 
a itudes toward mathematics: (1) more attention by 


textbook writers to those aspects of school which affect psychological 
determiners of success in mathematics; (2) more attention to teacher 
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selection and training and to the possibility of taking into account teacher 
characteristics when grouping pupils; (3) consideration given in course 
design to the meaning of education in mathematics for women; and (4) 
communication to parents about the nature of the effects which they have 
on children’s mathematics education. These are commendable goals on a 
broad scale. Bassham, Murphy and Murphy (1964) noted that to change 
a pupil’s attitude toward mathematics, his perception of himself in relation 
to mathematical materials must be changed. Therefore, what have other 
mathematics educators and researchers recommended and accomplished in 
their efforts to change students’ perceptions of themselves in relation to 
mathematics? 


Emphasis on Relevance, Meaningfulness, and Games 


Since the time of John Dewey there has been a growing emphasis 
on the need to make education practical and relevant. Nevertheless, 
Bernstein (1964) argued that educators have failed to stress sufficiently 
the use of mathematics for studying and controlling our physical and social 
environment. In one of the reports of the Ohio State University Develop- 
ment Fund, Nathan Lazar maintained that children can learn to like 
arithmetic if they are not bombarded with meaningless memory drills. 
His approach to helping children understand arithmetic is to use simple 
reasoning problems about black horses, brown cows, and white sheep. 
In addition, he invented a game-like apparatus called an “abacounter”— 
a variation on an abacus with multi-colored beads strung on rods—to help 
make mathematics more enjoyable and meaningful. Tulock (1957) also 
recommended that games, contests, and audio-visual aids be used to 
heighten interest in mathematics. 

f Zschocher (1965) experimentally investigated the effectiveness of 
various group mathematical games on the performance of first-grade 
children in day care centers in Germany. For five months, 70 girls and 
boys were given an opportunity to play the games before and after classes. 
The results were that the children’s scores on standard tests of number 
concepts, spatial orientation, and basic arithmetic rose significantly. A 
control group of 75 subjects showed no significant improvement on the 
tests. However, teachers did not discriminate significantly between the 
experimental and control children in their evaluations of the children’s 
mathematical achievement. In a related study of older children, Jones 
(1968) obtained a significant improvement in the attitudes of ninth-grade 


eat in remedial classes when they were taught mathematics by modi- 
ied programed lectures and mathematical games. 


Providing for Success Experiences 


Many writers (e.g., Lerch, 1961; Tulock, 1957) have observed that 
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pupils who consistently fail in mathematics lose self-confidence and develop 
feelings of dislike and hostility toward the subject. To cope with such 
negative attitudes, the teacher must provide success experiences for the 
learner: the child should be taught to set reasonable goals that culminate 


also referred to by Proctor (1965) in a discussion of techniques for giving 
self-confidence to slow learners and thus changing their attitudes toward 
mathematics. 


An Experiment on Mediated Transfer 


It is not particularly surprising that success in mathematics, which is 
a pleasant experience, can cause a person’s attitude toward the subject 
to become more positive. Although it is not always possible for an individual 
to succeed in mathematics, the results of an experiment by Natkin (1966) 
suggest that simply getting an individual to associate mathematics with 
something pleasant may improve his attitude or make him less anxious 
with respect to the subject. 

The initial group of subjects selected by Natkin were male and female 
undergraduates who scored above the mean on the verbal section and one 
standard deviation below the mean on the mathematical section of the 


(GSRs) of the subjects to math and nonmath stimuli. Then those subjects 
whose GSRs to the math stimuli were significantly greater than their 


control groups. In the first stage of the experiment proper, the subjects 
in the experimental group learned, by a paired associates procedure, to 
associate the mathematics stimuli with nonsense syllables; in the second 
stage they learned to associate the same nonsense syllables with strongly 
pleasant phrases. The subjects in the control group learned the same 
math stimuli-nonsense syllable pairs in the first stage as the experimental 
group, but the former learned nonsense syllable-neutral stimuli associations 
in the second stage. As Natkin predicted, scores on & test of anxiety 
toward mathematics showed a more significant decrease from pre- to post- 
experimental testing in the experimental group than in the control group. 
The post-experimental test of anxiety was administered only five ae 
after the learning session, however, and one might well question the 
permanence of the decrease. Other questions which need to be answered 
are whether the anxiety change observed by Natkin (1966) would have 
generalized to other situations (eg: school tests) involving mathematics 
and whether his “mediated transfer” procedure can also affect performance 
in mathematics. Nevertheless, Natkin concluded that the experimental 
procedure created a mediated “therapy” effect on mathematics anxiety, 
quite similar to the desensitization of fears by behavior therapy. He also 
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noted from the response patterns in the test data that early traumatic 
learning was largely responsible for anxiety toward mathematics, 
Natkin’s (1966) experiment is important because it showed, by means 
of a well-controlled experiment, that it is possible to affect anxiety toward 
mathematics, if only for a short time. There are behavior therapy tech- 
niques other than “mediated transfer” that should certainly be explored 
as possible methods for reducing mathematics anxiety. Finding out which 
techniques are most effective will involve more research; consequently this 
brings us to the closing section on evaluation of previous research and 
suggestions for further research on attitudes toward mathematics. 


Critique of Previous Studies and Suggestions for Further Research 
Criticisms 


A number of critical comments about previous research concerned 
with the determiners and effects of attitudes toward mathematics have 
already been made in this review. Some of these criticisms apply equally 
well to other areas of educational research, and they have been widely 
recognized. In general, there has been too much reliance on correlational 
methods and on indirect measures of behavior, such as questionnaires and 
other student-reports. It is admittedly easier to point to a need than to 
satisfy one, but the correlational results which have been reported need 
to be supplemented by controlled experiments to test the hypotheses 
Suggested by the significant correlation coefficients. As indicated previously, 
the application of analysis of covariance is questionable unless an investi- 
gator can satisfy the assumption of random assignment of subjects to treat- 
ments, independence of covariate and treatments, and no treatment-slope 
interaction. Certainly the procedures of “matching” and “statistical control 
of concomitant variables” should not be viewed as substitutes for random 
assignment of subjects to treatment conditions in the analysis of covariance. 
And more appropriate than analysis of covariance in most educational 
investigations is a randomized blocks design in which the various blocks 
represent levels of the pretest variable and individuals within the same 
block are assigned at random to treatments. 

A general discussion of research methods in education is given in the 
treatise by Campbell and Stanley (1963), who describe in detail the sources 
of error left uncontrolled in various research designs. Their proposal for 
obtaining information concerning cause-effect relations through correlations 
across time (cross-lagged panel correlations) would appear to be a poten- 
tially fruitful approach to an analysis of the direction of cause and effect 
in studies of teacher and pupil attitudes and achievement. Finally, when- 
ever correlation coefficients or other Statistics are to be computed on 
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differences (changes, gains) in attitudes, achievement, or other variables, 
the investigator should first become cognizant of the methodological issues 
involved in the use of these kinds of scores (see Thorndike, 1963; Cronbach 
& Furby, 1970). 

The remainder of the review will deal with some suggestions for 
further research on attitudes toward mathematics—research which it is 
hoped will take into account both the findings and shortcomings of the 
work that has been reviewed in preceding sections of this paper. 


Measures of Attitudes 


Since the usefulness of the results of research is frequently limited by 
the preciseness with which outcomes are measured, something needs to be 
done to improve the accuracy of measures of attitudes. The task may be 
approached in several ways. Anttonen (1968), for example, has ted 
to the need for research aimed toward improving the readability of attitude 
measurements at the elementary-school level. In addition, I feel that the 
concept of a general attitude toward mathematics should be supplemented 
with that of attitudes toward more specific aspects of mathematics, e.g., 
problem-solving and routine drill. This is similar to the recommendation 
made by Moss and Kagan (1961) with respect to the concept of achieve- 
ment. 

One possible approach to designing such multivariate attitude instru- 
ments is bas to a stimulus-response model like that proposed by Aiken 
(1962) and by Endler, Hunt and Rosenstein (1962) for the concept of 
anxiety. Such instruments should be of greater diagnostic than 
the current scales of general attitudes toward mathematics with their single, 
over-all score. The stimulus-response ch could also consider the 


other school subjects. Studies of the relationships of the factor structure 
of these attitude measures to age, sex, and other organismic variables, 
similar to Very’s (1967) investigations of mathematics abilities, would 
also be of interest. 


Teachers 


Although it is certainly unfair to indict teachers too strongly as 
creators of dete student attitudes toward mathematics, the ay Me 
research have suggested that the teacher, perhaps even more than the 


parents, is an important determiner of student attitudes. Banks (1964, 
pp. 16-17) wrote: 
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An unhealthy attitude toward arithmetic may result from a num- 
ber of causes. Parental attitude may be responsible. . . , 

failure is almost certain to produce a bad emotional reaction tothe 
study of arithmetic. Attitude of his peers will have their effects 
upon the child's attitude. But by far the most significant con- 
tributing factor is the attitude of the teacher. The teacher who 
feels insecure, who dreads and dislikes the subject, for whom 
arithmetic is largely rote manipulation, devoid of underst 
cannot avoid transmitting her feelings to the children, ... On 
the other hand, the teacher who has confidence, understanding, 
interest, and enthusiasm for arithmetic has gone a long way 
toward insuring success. 


To provide further information on the effects of teacher attitudes, 
more measures of teacher attitudes and their consequences (e.g, by 
classroom observation) should be obtained. Student reports of perceived 
teacher attitudes and teacher reports of their own attitudes are useful, 
but direct observation of teacher-pupil interaction in mathematics classes 
is also needed. In addition, more attention should be paid to the mathe- 
matics training of elementary-school teachers. If the law of primacy holds, 
the influence of elementary teachers on pupil attitudes should be even 
greater than that of secondary teachers, 

Finally, it would be interesting to conduct a Rosenthal-type investi- 
gation to determine the effects of teacher expectations in mathematics on 
student attitudes and achievement. Using the procedure of Rosenthal and 
Jacobson (1968), after the students are tested initially the teachers would 
be informed that one group of children—actually selected at random— 
will show an increase in mathematics achievement and positive attitudes 
toward mathematics during the following semester. Measures of change 
in student achievement and attitudes in both the experimental and control 
groups would be assessed to determine the effects of these variables of 
teachers’ expectations.’ 


Longitudinal, Multivariate, and Experimental Studies 


Alpert et al. (1963) and Anttonen (1968) have pointed to the need 
for longitudinal research On patterns of performance in mathematics 
emerging over time and on psychological variables related to these changes. 
Alpert et al. (1963) called for further classroom experimentation on the 
use of self-instructional programs, modern mathematics programs, specially 


“The merhodological shortcomings of the Rosenthal and Jacobson (1968) study con- 
pier p: Gask expectations on the scores of thee pile on an in py 
theless, it is reasonable to certainly be emphasized (see Thorndike, ged a 
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trained teachers, use of innovative teaching techniques, and training films. 
Anttonen (1968) maintained tha 

achievement in such studies should 
the six-year span which ; 

NLSMA studies, would appear to be most satisfactory. However, there 
is a need for both longitudinal and . 
effects and interactions of many variables—teachers, parents, curriculum, 
and such pupil variables as general and special abilities, biographical fac- 
tors, interests, and personality characteristics—on 

(see Cattell & Butcher, 1968). 

The implication of much of what has been said previously in this 
review is that multivariate should not be limited to correlational 
designs. Weaver and. G&G CIES) SAINE DAMN OA tee Boii 
development of mathematical ideas and abilities children exposed 
to different instructional conditions and in different environ- 
ments. They maintain that since the personality characteristics of children, 
instructional methods and materials, school organization, motivating condi- 
tions, and level and sequence of mathematical content interact to such a 
degree, ec ey Tran ies me ( 
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of these variables are necessary. This type of requires 
of the correlational and experimental to research. 


Development and Modification of Attitudes 


(1964) pointed to the desirability of further tal work to explore 
the development and modification of anxiety, attitudes, and other variables 
which affect achievement in mathematics. nN 


It is clear that serious thought must be given to experiments concerned 


Summary 


More than three dozen journal articles, two dozen doctoral desserta- 
tions, sta a half-dozen reports of studies concerned with attitudes toward 
mathematics which have been written during the past decade were re- 
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viewed. The major topics covered were: methods of measuring attitudes 
toward arithmetic and mathematics; the distribution and stability of mathe- 
matics attitudes; the effects of attitudes on achievement in mathematics; 
the relationships of mathematics attitudes to ability and personality factors, 
to parental attitudes and expectations, to peer attitudes, and to teacher 
characteristics, attitudes and behavior. Also discussed were investigations 
dealing with the effects of modern mathematics curricula and other cur- 
riculum practices on attitudes. Of all the factors affecting student attitudes 
toward mathematics, teacher attitudes are viewed as being of particular 
importance. Finally, research concerned with techniques for modifying 
negative attitudes and developing positive attitudes was summarized. 


Among the criticisms made of research on attitudes toward mathematics 
were the use of crude measures of attitudes, excessive reliance on correla- 
tional methods, improper use of covariance analysis, inadequate control of 
extraneous variables, and failure to use adequate measures of change. 
Suggestions for further research included adequate familiarization with 
previous studies concerned with the topic, the development of multifaceted 
measures of attitudes, more extensive multivariate experiments extending 
over longer periods of time, and more attention to techniques for developing 
positive attitudes and modifying negative attitudes. 
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REVIEW OF RESEARCH INVOLVING APPLIED 
BEHAVIOR ANALYSIS IN THE CLASSROOM 


Edward M. Hanley’ 


University of Vermont 


If one describes a study involving an applied analysis of behavior 
to a teacher, she will probably remark, “Why, what is new about that? 
I've been doing those things in my classroom for years!” If the teacher were 
asked to describe what she does in her classroom during a day, a week, or a 
year, one might even get a general description of her behavior that would 
enable a listener to classify her work as a form of applied behavior analysis. 
The description that she gave might loosely fit some of the dimensions of 
an applied behavior analysis as discussed by Baer, Wolf and Risley (1968). 
These dimensions involve an evaluation of applied research as it relates 
to the criteria of applied, behavioral, analytic, and technological, conceptual 
systems, both effective and general. 


of subject matter. She may also have quantified behaviorally her techniques 
by scores on tests, grades, etc. The teacher might have been analytic if, when 
she tried a specific technique and her students improved, the technique was 
then discontinued to see if the improvement deteriorated. Our teacher might 
also be technological, if other teachers can replicate her techniques. The 
teacher in question may to some extent relate her procedures to a conceptual 
system of basic behavioral concepts. Most teachers consider their techniques 
effective because the majority of their students leave the classroom at the 
end of the year with better academic skills than when they entered the 
classroom. Finally, most teachers would state that there is generality in 
the various behaviors that they effect. For example, the subtraction principles 
learned at the beginning of the year can be utilized in solving more complex 
problems such as long division. 

If one were to analyze critically the teacher’s statements and her actual 
behavior in the classroom, there would be only a slight correspondence 
between the criteria mentioned above and the teacher’s behavior. The lack 
of specific correspondence is explained in an article on behavioral modifica- 
tion in education (Homme & Tosti, p. 4). 


1Dr. Hanley gratefully acknowledges the helpful suggestions of Montrose Wolf, Donald 
Baer, Todd Risley, and R. Vance Hall. 
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It is not the lack of reinforcement which makes behavioral control 
difficult. One might reason, it might be the lack of knowledge of 
the principles of behavioral control. If one attempts to verify this 
he will find that this too is incorrect. If given a test on the principles 
so far discussed, most people would score very high. It is not a 
lack of reinforcement or lack of knowledge about how to use them. 
The difficulty can be primarily traced to the failure to system- 
atically apply what is known. It is not only that operant principles 
are not systematically applied, they are if applied at all, sporadi- 
cally applied. 


Unsystematic application precludes most teachers from qualifying their 
behavior in the classroom as an applied behavior analysis. In this paper, 
research categorized as applied behavior analyses are evaluated on the 
criteria cited by Baer et al. (1968) to indicate what is actually done when 
an applied behavior study is carried out, and to show how the teacher may 
better approximate the criteria involved in an applied behavior analysis. 
In a strict interpretation of the word, most people would not classify 
the teacher as a “researcher,” but the teacher does use fairly specific 
techniques in attempting to bring about changes in the behavior of children. 
By adopting a more research oriented approach to the problems of the 
classroom, the teacher could be in a better position to evaluate more fully 


the procedures that she uses in attempting to change the behavior of her 
students. 


In recent years there has been a greater emphasis on the systematic 
use of reinforcement procedures involved in the analysis of classroom 
behaviors. These principles are extensions and applications of earlier re- 
search in the techniques of operant conditioning (Skinner, 1938, 1953). 
Skinner’s work provided impetus for a vast body of laboratory research, 
peny with infra-human organisms (Journal of the Experimental Ana- 
ysis of Behavior, 1957) which is the precursor of much of the present 
applied research in educational settings. 

‘ Laboratory research using operant principles with humans to study 
praia in academic behaviors also contributed to the present research 
B UNE A T (Staats, Finley, Minke, Wolf & Brooks, 1964; Staats & 
pee aoe ). hese and probably the most important influence on 
paot Ri ucation was the large number of clinical behavior 
n AE Ei lies c out in a variety of nonacademic settings, in 
dn Sata eas ler study, although not of an educational or aca- 
beh 1 re, were of importance because of the applications of basic 
e piel principles which were used (Ullmann & Krasner, 1965). 

e main purpose of this paper is to review and evaluate the published 
research germane to an applied behavioral analysis of various basa of 
classroom settings. The behaviors most discussed in these settings are 
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academic, i.e., study behavior, development of skills such as reading, spelling 
and arithmetic; however, nonacademic behaviors are also ered. 
Some nonacademic behaviors are important in establishing a climate con- 
ducive to more effective academic behaviors. These behaviors are classified 
broadly as classroom behavior problems which interfere with the ongoing 
educational process. 

Because the majority of educational settings are not special education 
classrooms, this review mainly addresses the research pertinent to normal 
classroom settings. However, the past and present research being conducted 
in the normal classroom is largely an outgrowth of a great deal of research 
carried out in special educational settings. Therefore, the review must also 
consider special classrooms in order to include the contributions of that 
research. 

It is necessary to state the criteria defining these two settings; the 
following definitions are used. A special classroom is any classroom to which 
students have been assigned on the basis of one of the following diagnostic 


lations. Many times the teacher has at least one aide to assist her. 
general, these classrooms are labeled “special” because they contain students 
who have been singled out from the rest of the school population as being 
different enough to need special attention. The special classroom setting 
is usually a self-contained unit functioning under a different daily schedule 
than the normal classroom. 

A normal classroom is a setting in which there is usually one teacher 
who has not received special training for any specific population of children. 
The size of these classes usually ranges from 15 to 40 students who have not 
been diagnosed as having special problems. It is highly probable, however, 
that normal classrooms contain some students who exhibit the same be- 
haviors characteristic of some of the categories relevant to the special class- 


room. Thus, the normal classroom, in general, may be defined as any school 


class in which the teacher or the students have not been selected on the 


f classroom that could fit either of the two types 


of classrooms mentioned above, depending on the composition of the 
this area there exists research 


i ; i & 
(Allen, Hart, Buell, Harris & Wolf, 1964; Allen, Henke, Harris, Baer 

Reynolds, 1967; Buell, Stoddard, Harris & Baer, 1968; Bushell & Jacobson, 
1968; Bushell, Wrobel & Michaelis, 1968; Harris, Wolf & Baer, 1964; Harris, 
Johnston, Kelley & Wolf, 1964; Hart, Allen, Buell, Harris & Wolf, 1964; 
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Hart, Reynolds, Baer, Brawley & Harris, 1968; Hart & Risley, 1968; Wolf, 
Risley, Johnston, Harris & Allen, 1967). 

Because the behaviors dealt with in nursery school settings are not 
primarily academic behaviors, that research is not discussed in this review. 


Emphasis will be centered on the special and normal classrooms from kinder- 
garten to high school. 


The Applied Criterion 


In evaluating research in the classroom setting, one of the criteria used 
to assess a particular study is applied, For a study to be classified as applied 
depends, in part, on how important the problem is to society and on how 
important the behavior under Study is to the subject of the research. To an 
extent, the classification of applied or nonapplied involves a value judg- 
ment as to how society or the subject or the applier views the behavior under 
investigation. It is possible to make this judgment fairly objective if one 
considers the function of the response in relation to immediate or long 
range goals of society or the subject. For example, decreasing the number 
of out-of-seat responses occurring in a classroom is probably not as closely 
related to society’s goals of education as is increasing a child’s accuracy in 
arithmetic. But the former behavior is important because teachers generally 
feel that children must be seated at least for a large proportion of the time 
in a classroom in order to engage in those activities that lead to the goals 
of education. 

There is a continuum of social importance of behaviors. This continuum 
varies depending on who is defining importance. Nevertheless, all of the 
behaviors discussed here are related to society’s educational goals even 


Pape eae paiia of congruity. Thus, their study is considered to 


Special Classrooms 


i ge of inappropriate classroom behaviors. 
Improved study or task orientated behavior has been reported for under- 
achievers (Walker & Buckley, 1968; McKenzie, Clark, Wolf, Kothera & 
Benson, 1968) and retardates (Zimmerman, Zimmerman & Russell, 1968). 


Token systems have been used extensively in many classroom settings 
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to modify pupils’ behaviors. Token systems involve presentation of a unit 
of exchange (usually in the form of points, poker chips, etc.) contingent 
on some specific response. These tokens are later redeemed or exchanged for 
various back-up reinforcers. Back-up reinforcers generally used in educa- 
tional settings are often unconditioned reinforcers such as food or liquids. 
When these types of reinforcers are not appropriate, conditioned reinforcers 
in the form of toys, extra recess and other classroom privileges are exchanged 
for tokens. (Cf. Ayllon & Azrin, 1968, for a more comprehensive discussion 
of token economies.) 

In the remediation of academic problems, token systems have been 
used to improve grades (Tyler, 1967), achievement scores (Clark, Lacho- 
wicz & Wolf, 1968; Wolf, Giles & Hall, 1968) and correct answers on 
assigned tasks (Birnbrauer, Wolf, Kidder & Tague, 1965a, 1965b; Cohen, 
Filipczak & Bis, 1968; Tyler & Brown, 1968). 

Dyer (1968) increased completion of assignments by giving candy as a 
consequence of completed work. Using one of Premack’s (1965), 
high probability behaviors such as game activities were made contingent on 
the occurrence of low probability behaviors; Nolen, Kunzelman and Haring 
(1967) reported increased academic output of low-achieving junior-high- 
school students. 

A number of studies were aimed at both a reduction of disruptive 
behavior and improvement of academic behaviors (Birnbrauer & Lawler, 
1964; Martin, M., Burkholder, Rosenthal, Tharp & Thorne, 1968; Zimmer- 
man & Zimmerman, 1962). 

Attempts have also been made to improve conduct and attention-to- 
task behavior of hyperactive children (Quay, Sprague, Werry & McQueen, 
1966, 1967; Knowles, Prutsman & Raduege, 1968) and autistic children 
(Hudson & DeMyer, 1968; Rabb & Hewett, 1967; Martin, G., England, 
Kaprowy, Kilgour & Pilek, 1968). 

One of the most unique behaviors studied in the special classroom 
situation was that of a pupil who was an elective mute (Straughn, Potter 
& Hamilton, 1965). In this study, the pupil could earn points for an all- 
class party contingent on his verbal behavior in the classroom. 

Most of the problems thus far discussed are also found in the normal 
classroom situation, and the studies cited in special classrooms have ava 
large part provided heuristic stimuli for the research carried out in the 
normal classroom. 


Normal Classrooms 


The problems in a normal classroom are often quite similar to those in 
the special classroom. The main difference between the two settings is that 
the problems of the special class are usually more severe. 
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One important type of behavior that interferes with education in a 

normal classroom is disruptive or inappropriate behavior. In a series of 
studies (Madsen, Becker, Thomas, Koser & Plager, 1968; Madsen, Becker & 
Thomas, 1968; Thomas, Becker & Armstrong, 1968; Becker, Madsen, 
& Thomas, 1967; O'Leary & Becker, 1967), behaviors such as out-of-seat 
responses of children, excessive noise, disturbance of others, talking out in 
class, etc. were reduced by using contingent teacher attention, rules paired 
with praise for following the rules, and tokens contingent on appropriate 
conduct. In a recent study by Barrish, Saunders and Wolf (1969), out-of- 
seat behavior and talking-out responses were reduced in a fourth-grade 
classroom. In this study, natural classroom consequences, such as extra 
recess, stars, and being first to line up for lunch, were contingent upon 
appropriate in-seat and nontalking behavior. 

Once the above behaviors are under control, the teacher must engage 
the children in effective study behavior. Research aimed at improving study 
behavior and at reducing behavior that interferes with study has been 
conducted (Patterson, 1965; Hall, Lund & Jackson, 1968; Hall, Panyan, 
Rabon & Broden, 1969; Broden & Hall, 1969). Significant gains have been 
achieved in increasing the time spent in study behavior. 

When students are not engaged in disruptive behavior and spend more 
time studying or working, it is important to determine whether this study 
behavior produces increased accuracy on academic tasks. Panyan (1968) 
and Evans and Oswald (1968) conducted research on improved accuracy 
of academic behavior. Unfortunately, the research on output or accuracy 
of academic behavior in the normal classroom is still meager. Disruptive 
problems appear to have high priority in classroom research up to the 
present time. 

In viewing the problems investigated in the two types of classroom, 
it appears that the behaviors investigated are socially important and are 


important for the subjects involved, and thus are correctly classified as 
applied problems. 


The Behavioral Criterion 


A second dimension which is used in the evaluation of an applied 
PP’ 
Hom analysis is its behavioral characteristic (Baer et al., 1968). This 
ension involves measurement, According to Baer et al. (1968, p. 33): 
ok measurement of the reliability of human observers thus 
x car not merely good technique, but a prime criterion of 
whether the study was appropriately behavioral. 


To sey the behavior criterion involves two processes: (a) the objective 
etinition and measurement of the behavior under investigation and (b) 
the reliability evaluations of the measurement techniques. 


602 


HANLEY BEHAVIOR ANALYSIS IN THE CLASSROOM 


Because the behaviors under study in classroom settings do not readily 
lend themselves to automated recording, two techniques of observation and 
measurement have been utilized. The first technique involves a human 
observer recording either frequency counts of the occurring behavior or 
duration of each response under study. The second technique is time sam- 
pling, in which the observer records the occurrence or nonoccurrence of a 
response only once during a series of predetermined intervals. An example 
of this procedure is best exemplified by Madsen et al. (1968b, p. 141): 


Each observer had a clipboard, stopwatch, and rating sheet. The 
observer would watch for 10 sec. and use symbols to record the 
occurrence of behaviors. In each minute, ratings would be made 
in five consecutive 10-sec. intervals and the final 10 sec. would be 
used for recording comments, Each behavior category could be 
rated only once in a 10-sec. interval. 


There are variations on this time sampling procedure. The intervals may 
be increased or decreased in length. A may be recorded on the 
basis of a predetermined time criterion ra than simply on the basis 
of occurrence or nonoccurrence. 

ta. 


Both of the above observation methods yield different types of da 


frequency of these nses, especially if they had a high rate of occur- 
enk red this case, AANE sampling oan? poni bes son dasa g> 
if an investigator wants to determine mi spen - 
behavior or hos many times a particular child punches another child, then 
a duration or frequency measure would be appropriate. 

Achieving reliable measures is of the utmost importance. In —— 
reliability one is asking, “Will the same measurement technique picea 
twice to the same phenomenon give the same result?” This question involves 
an accuracy criterion of measurement. One way of checking the = 
ment device is to have independent measurements taken concurrently by 
someone other than the primary observer. If two observers are measuring 
the same behavior and their agreements as to occurrence, nonoccurrence 
and frequency of duration are correlated closely, the investigator can pae 
that the measurement technique is reliable at least for each point at which 
reliability measures have been taken. Another aspect of reliability involves 
a “homogeneity” criterion. In this case one is asking, “Is oe naga ay 
device measuring the same thing over time?” It is possible to get high 
reliability between observers over the course of a study and still not mest a 
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homogeneity criterion. If the response definition of the two observers “drifts” 
or changes as the study progresses, high reliability could result; but the 
original response under observation might be quite different from the 
response being measured later in the investigation. To overcome this prob- 
lem, one could use a third observer on occasion as a reliability measurement 
of the first two observers. Additional independent observers provide valuable 
information on the consistency of the response definition used throughout 
a study; they also increase the probability that other investigators could 
use the same measurement technique. Without reliability or with low 
correlation between the observer's data, it cannot be stated with any degree 
of accuracy what was being measured (if anything). The number of 
reliability measures taken to determine the reliability of the measurement 
technique seems to require a personal judgement on the part of each 
investigator. In general, most experimenters report reliability measures for 
each experimental condition with acceptable reliability usually meaning the 
range of 80 to 100% agreement between observers. Reliability between 
is usually determined by dividing the number of intervals in 
which both observers agree on the response occurrence by the total number 
of agreements plus disagreements. In some instances, an additional aspect 
of reliability may be indicated. If the behavior under study displays high, 
low and medium rates during the course of the investigation, then reliability 
measures should be taken at all of these levels to insure that the measure- 
ment technique is reliable regardless of the rate of the behavior. 
Sometimes reliability measures pose different problems. If the behavior 
under study is of such a nature that a permanent record of that behavior 
remains for analysis, the problem may be to assess the reliable reading of 
permanent record. In studies involving scores on achievement tests, 
number of assignments completed, correct answers on arithmetic assign- 
ments, etc., reliability measures could be obtained by employing additional 
Scorers to insure that the written response is reliably measured. A number 
of investigators have reported improvement on achievement or reading tests 
(Wolf et al., 1968; Clark et al., 1968; Cohen et al., 1968); school grades 
(Tyler, 1967); correct responses on assigned material (Nolen et al., 1967; 
Birnbrauer et al., 1965a, 1965b; Tyler & Brown, 1968); or number of assign- 
ments completed (Dyer, 1968). Although reliability measures were not 
reported in the above Studies, it is likely that errors of recording were not a 
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Special Classrooms 
Observer recording of behaviors which are indicative of studying by 
pupils is often called of task oriented behavior. This behavior 


requires fairly reliable measurement techniques. To evaluate the measure- 
ment device, reliability evaluations are often made by independent observers. 

Researchers on task oriented behavior (Walker & Buckley, 1968; Rabb 
& Hewett, 1967; McKenzie et al., 1968; Zimmerman et al., 1968) have re- 
ported reliability measures ranging from 75 to 100% agreement between 
independent observers. Reliability measures have also been taken on ap- 
propriate and inappropriate classroom behavior (Kuypers et al., 1968; 
Meichenbaum, Bowers & Ross, 1968; Martin, M. et al., 1968) with high reli- 
ability coefficients between observers. 

There are a number of studies in the special classroom setting that 
cannot be evaluated on the reliability of the measurement technique in- 
volved (Quay et al., 1966, 1967; Martin, G. et al., 1968; Sulzbacher & 
Houser, 1968; Zimmerman & Zimmerman, 1962; Knowles et al., 1968; Birn- 
brauer & Lawler, 1964; Hudson & DeMyer, 1968; Carlson et al., 1968) 
because reliability measures, if taken, were not reported. 

If reliability measures are not taken on a particular behavior, it does 
not necessarily follow that the recording technique was poor. But it is diffi- 
cult then to assess whose behavior was changed over time—the recorder’s 
or the subject’s. Without reliability measures taken by independent observ- 


change over time or over conditions or for different levels of the behaviors 
under study. Reliability measures indicate the likelihood of replication of 


Normal Classrooms 


In comparing normal classrooms to the special setting, one is struck by 
the pree number of students and the amount of activity. The 
normal classroom has more students and seems to exert less control over 
each student. Special classrooms many times have special observation rooms 
and recording facilities, and there are usually more teachers and aides in 
them than in the normal classroom. Thus, research involving recording of 
complex responses is easier to do in a special classroom. 

It is interesting to note that of all the normal classroom studies in which 
reliability measures would be indicated because of the behaviors being re- 
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corded, only one (Patterson, Jones, Whittier & Wright, 1965) failed to 
report reliability. The other researchers in normal classrooms not only 
reported reliability, but showed some improvement of the measurement 
techniques over time. In several studies, observers were trained in recording 
techniques prior to recording baseline data; the pre-training was an attempt 
to increase the probability of high observer reliability at later stages of the 
study (Becker et al., 1967; Walker & Buckley, 1968). This pre-training of 
observers helps familiarize observers with various recording tools prior to 
the actual study and provides them with an opportunity to define and solve 
any unanticipated problems that might be peculiar to the measurement 
technique. 

Reliability measures were reported at least once for all conditions of a 
study (Hall et al., 1969; Broden & Hall, 1969; Barrish et al., 1969) rather 
than the usual statement that reliability checks were made occasionally. 
This type of reporting indicates that the observational measurement device 
was checked for reliability over varying conditions. In an attempt to elim- 
inate observer bias, Madsen et al. (1968b) instigated experimental changes 
without informing the observer. 

In reviewing the research in the normal and special settings using the 
“behavioral” criterion of Baer et al. (1968), one might infer that because 
measures of reliability are more common in the normal classroom literature, 
research in the special classroom is not as rigorous as it should be. This 
need not be. The majority of special classroom research was conducted prior 
to most of the normal classroom research, and it is only natural that the 


later research should have built upon and refined the techniques of earlier 
endeavors. 


In referring back to the hypothetical teacher, it is obvious that she does 
not often meet the behavioral criterion, However in public school settings, 
teachers can meet this criterion. The teacher's measurement can be reli- 
ably checked by others, such as teacher aides, guidance counselors and 
teacher trainees. If the teacher does meet the behavioral criterion she will be 
in a better position to assess her methods of promoting various classroom 


behaviors and will be able to refine her measurement technique. 


The Analytic Criterion 


p the discussion of the behavioral criterion, emphasis was placed on 
re x7 ility of the measurement device, namely the human observer. When 
Ae ar s t research by the criterion of analytic, one is assessing 

e processes or d i i i 
nee bee procedures used in effecting changes in 


It is not enough to demonstrate that i i 
t t ed to 
a specific type of behavior was follo a ao h 


wed by an increase or decrease in the be- 
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havior. It is always possible that the introduction of the procedure may have 
coincided with another change in the organism’s environment which actu- 
ally produced the observed changes in behavior. To provide evidence that 
the procedure used was the affective one, two basic techniques have been 
evolved to test the controlling aspect of the procedure used. 

The first technique is commonly called a reversal or ABA design. In 
this technique, a behavior is measured over time (A) until it appears that 
the behavior is relatively stable. When the behavior is not changing to any 
great extent, an experimental variable is applied (B) to the behavior in 
question to observe what effect it will have on that behavior. If the experi- 
mental variable appears to have a noticeable effect on the behavior, then 
the procedures are changed back to condition (A) again with the experi- 
mental variable removed. The rationale for the ABA design involves the 
concept of baseline logic. The first (A) condition or baseline is used as a 
basis of prediction of what the behavior would be at a future time if the 
experimental condition (B) had not been imposed on the behavior. The 
behavior measured under the experimental condition (B) when compared 
to the first baseline (A) indicates whether a change in the behavior has 
occurred. The second baseline condition (A) or reversal condition is used 
to test the accuracy of the original prediction of what the level of behavior 
would have been at a future time in the absence of the experimental vari- 
able. If the level of behavior in the reversal condition is similar to that of 
the baseline condition, the original prediction is verified. In using an ABA 
design, the experimenter is attempting to establish that the experimental 
procedures effected the change in the behavior under study. 


There are many variations of an ABA design. The experimental vari- 
able might be reinstated after the second baseline (A) condition resulting 
in an ABAB design, or additional variables may be added to the original 
experimental variable leading to an ABACA design. In the ABA design, 
condition (B) could be the removal of a variable involved in (A) rather 
than the addition of a new one. In a study by Birnbrauer et al. (1965b), 
tokens were given contingent on correct academic responses during (A). 


They were removed (extinction) during (B) and reinstated during the 
g the (B) condition indicated a decrease 


second (A) condition; results durin: À i 
in accuracy on the academic task. Because applied research in Sarera 
settings usually involves remedial or therapeutic aspects, any Rie i a 
which bring about improvement in a student’s behavior are usually rein- 
stated prior to termination of research efforts. 
It is obvious that the use of the reversal technique presents problems in 
a school setting. When behaviors of a desirable educational nature te 
produced, teachers, principals and researchers are reluctant is aii z : 
past conditions (when the behavior was either absent or disruptive a 
second problem is the failure of the behavior to return to previous levels 
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when the experimental variable is changed. It appears that there may be 
times when the changed behavior may come into contact with natural rein- 
forcers in a student's environment, and the changes in behavior may be 
maintained by these reinforcers. For example, reading skills developed 
through reinforcement may later be maintained at a high level by the con- 
tent of what is read, even when the student is no longer given the reinforcer 
used to bring about this improvement. 

When situational constraints or problems of irreversibility mitigate 
against a reversal technique, a relatively new technique may be used which 
allows a demonstration of control without the problems peculiar to a re- 
versal. Risley and Baer (1969, pp. 5-6) called this a “multiple baseline” 
technique. 


Any change in the level of this behavior is compared with the level 
predicted for that behavior from the baseline measures. The accur- 
acy of this prediction is assessed by comparing this prediction with 
the continuing measures of the other behavior(s). If, in fact, the 
level of the other behavior(s) remains relatively constant, and to 
the extent that it can be assumed that uncontrolled variables, if they 
had occurred, would have similarly effected all of the behaviors 
measured, the baseline prediction of the first behavior is supported. 
This is a somewhat weaker design than the A-B-A design, since it 
involves an additional assumption: that all the measured behaviors 
are susceptible to the same variables. This latter assumption is, 
however, supported by demonstrating that the other behaviors are 
also susceptible to the same experimental procedures as the first, by 
applying those procedures to the second behavior, and so on. 


Since behavioral analysis is relatively new in the classroom setting, 
Some attempt must be made to demonstrate empirically the reliability of the 
techniques involved, to enlarge the findings by “systematic replication” 


(Sidman, 1960) and to demonstrate the relevance of behavioral analysis to 
the setting. 


Special Classrooms 


The use of a specific control techni i a 
ie than nique does not automatically guaran 


at when tokens were discontinued for ac- 
ects showed little change in accuracy, and 
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10 subjects showed increased errors. Zimmerman et al. (1968) reported 
similar individual differences among students subjected to reversal tech- 
niques. Straughn et al. (1965) found that the verbal behavior of an elective 
mute actually increased after the reinforcement procedures were discon- 
tinued. When irreversibility is encountered with some subjects, as it was 
in the studies above, it becomes difficult to evaluate the results obtained. It 
is possible that the change in the behavior was a coincidence rather than a 
function of the experimental variable. A second possibility is that the 
changed behavior was maintained by other natural environmental variables 
after the experimental variable was removed. Both of these alternative 
explanations unfortunately must remain at the level of speculation. 

Lack of an adequate demonstration of the controlling function of a 


variable may also be caused by prematurely an experimental 
variable upon the behavior while the behavior is still (Kuypers 
et al., 1968). For example, if disruptive behavior seems to be decreasing 


and an experimental variable is introduced to eliminate it, the experimenter 
will have difficulty determining whether a continued decrease is caused by 
experimental change or if the decrease would have continued without apply- 
ing any experimental change. 

A reversal technique that does not involve extinction procedures can 
be used to evaluate the function of a particular reinforcer. Instead of remov- 
ing the experimental variable in an ABA design, Nolen et al. (1967) 
changed from contingent to noncontingent tokens in the (B) phase of their 
study designed to increase correct academic responses. The results indicated 
a decrease in accuracy when tokens were noncontingent and an increase 
in correct responses when tokens were contingent. 


Walker and Buckley (1968) presented an excellent example of control 
of study behavior in an ABAB design in a laboratory setting. In their study, 
task orientation was increased significantly when study behavior was rein- 
forced with points redeemable for a toy. A reversal procedure reduced study 
behavior and thus demonstrated the function of the reinforcers. When the 
reinforcement was reinstated in a normal classroom setting, the study rate 
rose again to the level achieved in the laboratory setting. 


There are a number of studies in which behavioral changes of a perma- 
nent nature do not readily lend themselves to reversal techniques. In these 
cases, indirect analysis of the experimental variables must be made. Clark 
et al. (1968), while attempting to improve achievement levels of school 
drop-outs, demonstrated the effects of tokens within an ABA design by vary- 
ing the amount of tokens earned for completing reading assignments. Re- 
sults of this study indicated that the number of assignments completed was 
positively related to the number of tokens that could be earned for each 
assignment. Ina similar study, Wolf et al. (1968) used Clark’s technique to 
show that student preferences for various subject matters (English, reading, 
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arithmetic) could be changed by varying the number of points that could 
be earned in each subject area. These two studies somewhat obviate the 
often heard excuse that control measures were not used either because the 
behavior could not be reversed (achievement scores) or because the behavior 
that was developed was too important to reverse. 

Tyler and Brown (1968) reported mean improvement in correct respon- 
ses on tests based on daily newscasts when tokens were contingent on correct 
answers, and a decrement in correct responses when tokens were presented 
on a noncontingent basis. The authors used two groups and reversed pro- 
cedures with each group. Instead of a typical ABA design, Tyler and Brown 
used one group as an estimate or baseline of what the other group would 
have done if the experimental variable had not been introduced. 


The use of the multiple baseline technique is relatively new to applied 
behavior analysis. Using a time sampling measurement technique, Meichen- 
baum et al. (1968) reported a successful application of a multiple baseline 
technique. In their study, two different groups of institutionalized girls 
were observed in both morning and afternoon classes. Baselines were estab- 
lished for each group in both their morning and afternoon classes for appro- 
priate classroom behavior (task oriented behavior). When points were given 
in the afternoon classes for appropriate behavior, an increase in this behavior 
resulted, while the task oriented behavior in the morning classes remained 
the same. After a number of sessions, points were also made contingent on 
appropriate behavior in the morning sessions; the change resulted in an 
increase in task orientation during the morning period. In the last phase 
of the study, the points they earned were reported to the students during 
the observation period. More immediate feedback of earned points resulted 
in maintaining a high rate of appropriate classroom behavior. 


Bachrach (1963, p. 1) stated, “People don’t usually do research the 
way people who write books about research say that people do research.” 
His comment seems to apply to some studies in special classroom settings. 
There are studies in which data either have not been presented or have been 
presented at the anecdotal level; it is difficult to evaluate any technique so 
reported. Zimmerman and Zimmerman (1962) found decreases in undesir- 
able classroom behavior of two children subjected to extinction procedures 
for inappropriate behavior and social reinforcement for appropriate be- 
havior. Data were not presented on the occurrence of the undesirable 
behaviors before or during experimental conditions; thus, evaluation of the 
procedures used is not possible. Knowles et al. (1968) described the reduc- 
tion of hyperkinetic behavior and the improvement of writing skills of a 
seven-year-old child that was achieved by giving him candy contingent on 
appropriate behavior. Since no data were presented on the frequency of 
responses under study prior to or during the experimental manipulation, 
evaluation of the procedures used can not be performed. In a study in which 
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tokens were used to’ increase task orientation of autistic children, Rabb and 
Hewett (1967) reported that severely disturbed children can function as a 
group in a classroom; unfortunately, the authors did not report specifically 
on the effects of tokens on task oriented behavior with these children. Hud- 
son and DeMyer (1968) reported that autistic children would use food as a 
media for finger painting and other arts and crafts activities. Since the 
data from this study are presented in anecdotal form, it is difficult to deter- 
mine whether the children used the food media because it was reinforcing 
or because the therapist no longer attended to them when they ate the 
media. The improvements in behaviors cited in the above studies are prob- 
ably quite important in the settings in which they occurred; but since the 
authors did not clearly specify what changes did occur and what variables 
were responsible for the changes, one cannot adequately evaluate the pro- 
cedures used. 

The authors of another group of studies presented data which indicate 
that important behaviors were changed, but they did not attempt any sys- 
tematic manipulations of the experimental variables (McKenzie et al., 1968; 
Martin, G. et al., 1968; Quay et al., 1966; Cohen et al., 1968; Dyer, 1968; 
Martin, M. et al., 1968; Tyler, 1967; Carlson et al. 1968). These studies 
were AB designs in which reversal techniques were not carried out. Al- 
though changes in the behaviors under study were reported, the causes of 
the changes were not empirically tested. 


Normal Classrooms 


research often is able to improve and capitalize on previous research. It is 
also true that as the approach of applied behavioral analysis “a me 
the requirements for a reliable demonstration of control techniques become 
more rigorous. To state it simply, ens ee must ih ea. 
all of its efforts in order to demonstrate that a p enomenon ¢ , 

oct the factors that produce 


and subsequent research must examine more closely 
the phenomenon. It could be that the audience that must allow the experi- 
h in the normal classroom setting is more skeptical 
than that of the special setting. Since there have been many oo a 
tions in education that have not lived up to prior ee eng ie ie 
for rigorous proof may be a function of a past history © nonreinforceme’ 
for this audience. 

In the normal classroom, 
teacher attention and praise (Bec here 
to reinforce reductions of disruptive classroom behavior. 


a token system (O’Leary & Becker, 1967) or 
ker et al., 1967) have sometimes been used 
The authors of 
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these two studies reported a decrease in disruptive behavior when award 
tokens or attention was made contingent on nondisruptive behavior (study- 
ing, sitting quietly, etc.). Since the conditions of the setting prevented the 
authors from testing the function of the reinforcers by reversal techniques 
and from employing the multiple baseline approach, it is not possible to 
evaluate the role of these reinforcers clearly. 

In a later study (Thomas et al., 1968) reversal procedures were carried 
out in a number of phases of the study to evaluate the effect of praise for 
appropriate behavior versus disapproval for inappropriate behavior. Teacher 
approval (praise, physical contact, etc.) contingent upon appropriate or 
nondisruptive behavior decreased the incidence of disruptive behavior, while 
no approval resulted in an increase in disruptive behavior. Frequent disap- 
proval (scolding, spanking, etc.) of a disruptive behavior increased the prob- 
ability of that disruptive behavior. Accompanying the decrease in inappro- 
priate behavior were increases in the percentage of time students spent on 
appropriate behaviors, such as attending to the teacher or the task. A recent 
study by Barrish et al. (1969) involved a multiple baseline design and re- 
versal procedures. Out-of-seat and talking-out behaviors were reduced in 
a fourth-grade class by using individual contingencies (check marks for 
inappropriate behavior) and a group consequence (possible loss of privileges 
for the individual’s team as a result of the number of check marks). In this 
study, successful reversal and reinstatement procedures were carried out for 
each of the two responses during a math period; the multiple baseline design 
for the two behaviors was carried out during a math and a reading period. 
The results indicated that using a reversal technique in the normal setting 
does not seem to permanently impair the students’ behavior when reinforce- 
ment is reinstated. It appears from these results that reversal procedures 
could be used in research on many other behaviors in other classroom set- 
tings; such research could provide a measure of reliability of the variables 
under study, without harming the improved behavior. 


Very few studies on accuracy or rate of academic output in normal 
classrooms have been reported. Panyan (1968) used teacher praise and 
attention for correct answers to increase the rate and accuracy of arithmetic 
skills of eight fourth-grade students, Using an ABA design, she found that 
all but one subject increased their rates of problems-per-minute during the 
attention phase (B) of the study. She also found that the subjects’ accuracy 
did not decline during this period. Unfortunately, the controlling aspect 
of teacher attention is unclear; when a reversal was instituted, many of the 
students continued to improve in response rate, and all subjects showed 
little or no change in accuracy after the removal of teacher attention. 

In a series of studies using peer reinforcement t iques, Evans and 
Oswalt ( 1968) found that five of six low-achieving Sade on 
weekly tests if each day during the week they could earn free recess time 
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for themselves and the rest of the class by answering questions correctly on 
academic subjects. Reversal conditions were instituted only for two students 
in these studies, and the results for each seem to be in opposition to the 
other; but in comparing the improvement of the experimental subjects with 
control subjects, it appears that peer reinforcement was an effective tech- 
nique in bringing about improved test behavior. In research in the normal 
classroom by Hall and his associates (Hall et al., 1968, 1969; Broden & 
Hall, 1969), the effects of the teacher's behavior on students’ study rates 
were investigated. Reversal procedures in these studies reflect clearly the 
reinforcing effects of teacher attention (Hall et al., 1968) and tokens 
(Broden & Hall, 1969) on study rates. The findings of these three studies 
indicate that when teacher attention was programmed contingent on study 
behavior, there was an increase in the study behavior of the students. The 
data also indicate that removal of this attention resulted in a clear decre- 
ment of study behavior. These studies are most significant not only for 
the contribution they make to the analysis of behavior in the classroom 


condition in this study began with the rules in effect; thus, no evaluation 
of rules as such was rs In their second study, Madsen et al. (1968b) 
evaluated the effect of rules versus no rules. They found that rules alone 
decreased disruptive behavior very little, and that rules plus ignoring the 
behavior actually increased the disruptive behavior. However, the com ee 
tion of rules and ignoring plus praise for appropriate behavior deus 
the disruptive behavior. Reversals were carried out in this study to confirm 
the reliability of the method. A more general comparison between the two 
studies is not appropriate: the data of the first represented 45 children, but 
the data of the second represented only 2 children. 


An unusual procedure was used to present immediate reinforcement 
for increased paste activity of a hyperactive child (Patterson, 1965). In 
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this study, a light flash and counter click were contingent on task oriented 
behavior. Each flash of the light indicated that the student and his peers 
had earned either one penny or one candy. The candies or pennies earned 
during each session were later divided among the class. Study rate increased, 
but in the absence of control conditions the increase is difficult to evaluate. 
In a follow-up study utilizing two subjects (one as control), Patterson et al. 
(1965) did use reversal procedures, but the improved attending behavior 
did not revert to baseline conditions. One of the problems cited in these 
two studies was that social reinforcement by peers was neither controlled 
nor recorded; therefore, it is difficult to separate the effects of this possible 
social reinforcement of peers from the reinforcement of pennies and candy. 


There are additional analyses that can be made once the reliability of 
behavior change has been demonstrated adequately. The first of these is 
component analysis, in which the various aspects of all of the procedures 
that make up a condition are analyzed. Close examination of the procedures 
used in the various settings shows that many of the stimuli called reinforcers 
are quite complex. In most all of the studies of token reinforcement, social 
reinforcement in the form of praise or teacher attention was paired with 
the presentation of the tokens. The concept of social reinforcement or teacher 
attention also contains many components (touching, praising, smiling, etc.) 
that might be analyzed to determine their separate effects. Up to this time, 
no complete analyses have been carried out on the various component parts 
of the procedures used in classroom settings. At the present time, many 
procedures in the classroom are of a “shotgun” approach, in that many 
variables are involved in the total stimulus complex. In order to more 
clearly specify the role of each aspect of this complex, component analyses 
must be carried out. Only a component analysis can lead to a clearer speci- 
fication as to what are the necessary and sufficient conditions needed to 
bring about changes in various classroom behaviors. 


Another type of analysis concerns the amounts or values of the proce- 
dures used. This is usually called a parametric analysis. Parametric analyses 
were performed in some studies involving the use of token reinforcement 
(Wolf et al., 1968; Clark et al., 1968). It is readily apparent that the 
values of tokens are more easily manipulated than the value of a social 
reinforcer such as praise; parametric analyses must await more precise 
pane Ai of the various reinforcers that have been used in classroom 


In observing the forms of analytic progression in other areas such as 
the early research of Skinner and his ‘este it appears that the historical 
pattern starts with a demonstration that a phenomenon can be measured 
(Phase 1) and reliability demonstrated (Phase 2); the later phases of re- 
search are analyses of parameters and components of the process involved 
in behavioral change. At the present time, the emphasis of applied behavior 


614 


HANLEY BEHAVIOR ANALYSIS IN THE CLASSROOM 


analysis in the classroom setting seems to be on the first two phases, with 
indications that a more sophisticated analysis of components and parameters 
of reinforcement will be made in subsequent research. Although component 
and parametric analyses seem to be done later in the progression of re- 
search, this does not mean that these two aspects are of lesser importance 
than measurement and reliability. Only when all of these aspects of research 
are carried out can there be a complete applied analysis of behavior. Com- 
ponent and parametric analyses must be conducted in research on class- 
room behaviors if researchers are to be in a position to specify with precision 
not only the variables, but also the parameters of these variables, which are 
functional in behavioral change. 

From this discussion of an analytical approach, it can be seen that the 
teacher in our early example only slightly approximates the requirements 
of the analytic criterion. This does not mean that it is impossible for the 
teacher to carry out analytic techniques. The teacher can utilize a multiple 
baseline or reversal technique as these examples show. Only by being 
analytic can the teacher evaluate her methods and thus hope to refine them. 


The Technological Criterion 


In evaluating research by a technological criterion, one is assessing the 
completeness and clarity of the description of the procedures used. According 
to Baer et al. (1968, p. 95): 


The best rule of thumb for evaluating a procedure as technological 
is probably to ask whether a typically trained reader could replicate 
that procedure well enough to produce the same results, given only 
a reading of the description. 


This general rule will be used in evaluating classroom research, but it 
should be clear that there are usually divergent opinions of the technological 
adequacy of description depending on how well the evaluator fits the mold 
of a “typically trained reader.” 

Whether research is done in a special classroom or a normal classroom, 
complete identification of procedures can and should be made. Clear and 
complete identification of techniques is necessary regardless of the settings 
or situational constraints in which an applied behavior analysis is carried 
out. Therefore, the research discussed using the criterion of technological 
will not be dichotomized into settings. Í 

Much classroom analysis appears to be technologically adequate to 
insure replication. In a series of studies on disruptive behaviors, Madsen et 
al. (1968a, 1968b), Carlson et al. (1968), Kuypers et al. (1968), O'Leary 
and Becker (1967), Becker et al. (1967), and Thomas et al. (1968) used 
very similar techniques and have demonstrated direct and systematic replica- 


tion of the processes involved. 
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Replication of various procedures applied to study behavior was re- 
ported by Hall et al. (1968, 1969), Broden and Hall (1969) and Walker 
and Buckley (1968). Token reinforcement procedures used to improve 
academic behavior were replicated by Birnbrauer et al. (1965a, 1965b), 
Clark et al. (1968) and Wolf et al. (1968). 


Techniques of social reinforcement and a unique feedback device for 
presenting points contingent upon appropriate classroom behavior were 
replicated in a series of studies (Patterson, 1965; Patterson et al., 1965; 
Quay et al., 1966, 1967). 


Replication does not necessarily establish replicability; it is possible 
that the replications in some of the above studies could have developed 
independently without the investigators reading each others descriptions. 
Whether this happened in the above studies or not, they did contain clear 
technological descriptions which indicate that a typically trained reader 
could probably repeat these procedures and achieve similar results. 


There are several studies which have not been replicated, but which 
appear to have complete and clear descriptions of procedures; thus, they 
probably satisfy the criterion of being technological (Cohen et al., 1968; 
Tyler & Brown, 1968; Evans & Oswalt, 1968; Panyan, 1968; McKenzie et al., 
1968; Martin, G. et al., 1968; Zimmerman, 1968; Sulzbacher & Houser, 1968; 
Martin, M. et al., 1968; Tyler, 1967; Barrish et al., 1969). 


There are some classroom investigations that might prove difficult to 
replicate because the procedures are not described clearly. Dyer (1968) 
described a failure to achieve an increase in completed school assignments 
by the use of tokens contingent upon completed work, but he never stated 
what back-up reinforcers, if any, were used in exchange for the tokens 
earned, M. Martin et al. (1968) described a token system and a special 

phase system used to reintegrate adolescent deviates into the normal 
school setting. Neither specific contingency measures between response 
requirements and tokens nor the criteria for promotion from one “phase” 
to another were completely described, 


In a study by Nolen et al, (1967), points backed up by access to high 
probability activities were made enitigent upon eet completion of 
academic tasks. However, the criterion of successful was never stated. The 
investigators reported that they gradually increased the length and difficulty 
of the stimulus materials for each student on an individual basis; however, 
it is not clear from the description how the progression from one stage 
to the next was evaluated and specifically what materials were used in the 
composition of the academic tas 

Straughn et al. (1965) found that social reinforcement procedures 
effectively increased classroom talking of an elective mute. In Straughn’s 
study, the definition of the response was “clear, meaningful vocalization.” 
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It is not clear from this definition exactly what constituted one complete 
response unit. 

With some studies, although the description of the techniques seems 
clear enough for replication, the lack of data on the results makes it im- 
possible to evaluate congruence of the results of a replication (Hudson & 
DeMyer, 1968; Birnbrauer & Lawler, 1964; Rabb & Hewett, 1967; Zimmer- 
man & Zimmerman, 1962; Knowles et al., 1968). 


It appears that clear and precise description is found most often in 
studies in which the demonstrated effects are clearly visible and most 
effective in bringing about behavioral improvement (Hall et al., 1968, 1969; 
Broden & Hall, 1969; Thomas et al., 1968a; Wolf et al., 1968). It is probably 
not accidental that clear and precise descriptions seem to appear with 
demonstrable behavior changes. Clear and precise descriptions indicate 
well-planned strategies and tactics of research which would increase the 
likelihood of successful application of the procedures being investigated. 


One might feel that because the teacher in the classroom usually does 
not engage in writing her procedures in the same manner as a behavioral 
researcher, that the technological criterion is probably not appropriate for 
her. This is not necessarily true. Teachers do write their lesson plans or 
procedures for daily work, and in many cases these procedures may have 
to be replicated by a substitute teacher. If these plans (procedures) are not 
complete and clear, the teacher might be confronted with many problems 
during a class day because of incomplete procedural plans. The problems 
arising from poor technological descriptions become compounded when a 
substitute teacher must take over. Teachers not only can profit from 
techniques of classroom management using an applied behavior analysis 
approach, but they can also become more proficient in their own techniques 
of classroom management by becoming precise in their technological descrip- 
tion of everyday procedures. The “good” teacher is usually described as one 
who plans ahead and is well organized. This seems to be another way of 
saying that the good teacher has a clear and concise technological description 
of what her daily procedures will be in the classroom. 


The Conceptual Criterion 


The studies of applied behavior analysis in special and normal class- 
room settings consistently relate technological descriptions to basic concepts 
of the science of human behavior. Teacher attention has usually been 
referred to as social reinforcement. Techniques for progressive changes in 
the topography of the classroom response under study have been ea 
as reinforcement, punishment, extinction and shaping procedures. rokens, 
points and other exchangeable stimuli have been classified as conditioned 
reinforcers. Various temporal or frequency aspects of the presentation of 
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reinforcement have been described according to well-known schedules of 
reinforcement (Ferster & Skinner, 1957). With relatively minor 

tions, the field of applied classroom research appears to satisfy the criterion 
of being conceptually systematic. 


Obviously, the majority of teachers do not relate their technological 
descriptions to basic behavioral concepts because the majority of teachers 
are not well versed in behavioral concepts. If teachers were to relate some 
of their techniques to these basic concepts, much might be gained by other 
teachers. In every school there are “good” teachers. The explanation often 
given is that they are good teachers because they have learned the “tricks 
of the trade” from years of experience. But “tricks” need to be memorized, 
and the more there are, the harder it is to teach and learn them. Principles 
are fewer, easily taught and learned, and will produce the “tricks” if 
logically used. Principles generate more “tricks,” but “tricks” alone do not. 
If the good teachers’ techniques could be stated in a systematic way as 
behavioral principles (rather than “tricks”), they could provide an existing 
system of knowledge for other teachers to draw from. 


The Effective Criterion 


A sixth criterion for an applied behavior analysis deals with the concept 
of utility. Utility answers the question: Were the effects large enough to be 
useful to the subject and those who must work with the subject? In review- 
ing the literature of applied behavioral analyses in the classroom, I found 
obvious problems in judging the effectiveness of the behavioral change 
because the relevance of the behavior to the setting can only be inferred. 


A study in which autistic children’s responses were shaped to finger 
paint with ice cream sauce (Hudson & DeMyer, 1968) may seem to the 
casual reader to lack practical value, but a study that demonstrates the 
improvement of academic ability of low-achieving poverty area children 
(Wolf et al., 1968) might seem to have practical value. Both of these 
judgments must be made with consideration of the milieu in which the 
behavior occurs, In a class of autistic children, the tremendous deficit in 
constructive or functional behaviors of these children quickly becomes 
evident to the observer, The staff who deal with autistic children would 
consider any small improvement to be highly relevant. 


Anecdotal information has been given i tudi nting data 
on demonstrated changes in behavior eaii 


principals, aides) felt that the changes made were important in their setting. 


Accordingly, the effectiveness of an applied behavioral change should be 
judged by those who are in the position to be most qualified. 
The teacher in our hypo 
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whether the behavioral changes she makes are effective; but to make these 
judgments she must be in a position to observe how large the changes were. 
To do this effectively, the teacher must have some empirical indication that 
a change has occurred and that it was a function of her procedures. Lack 
of this data reemphasizes the need for the teacher to take behavioral mea- 
sures reliably and to analyze her procedures in order to judge their effective- 


ness. 


The Generality Criterion 


The criterion of generality is one of the most important and most often 
neglected aspects of applied behavior research in the classroom setting. In 
assessing the concept of generality, one is asking if the behavioral change is 
relatively permanent over time and if the change not only occurs under 
the specific conditions of training but also spreads to other settings and 
other behaviors. For example, if the study rate of a pupil working on arith- 
metic is improved, will this improvement last and will he also study more 
of other academic materials? 

A few attempts were made to evaluate the permanency of change by 
the use of postchecks of task-oriented behavior of the student under study; 
the results indicate that the improvements in the behavior were still being 
maintained months after the studies were terminated (Hall et al., 1968, 
1969; Broden & Hall, 1969). 

McKenzie et al. (1968) extended reinforcement (money) for grades to 
students returning to their regular classroom from a special one and found 
that the students were maintaining high academic standards in the normal 
setting. Students from the special classroom returning to their regular class- 
room without reinforcement for grades also maintained high academic 
standards, which demonstrated generalization of the improved behavior. 
Walker and Buckley (1968) programmed improved task orientation from a 
laboratory setting to a regular classroom with a low achiever by instructing 
the teacher to present social reinforcement contingent on task-oriented 
behavior. n 

Generality was also demonstrated in a reduction of disruptive 
behaviors during observation periods prior to conditioning sessions by simply 
having an observer present in the classroom (Patterson, 1965). Many other 
studies included references to generalization effects at the level of proie, 
but few studies evaluated empirically or programmed generalization of the 
behavior under investigation. NA PNN 

The teacher in a classroom is probably in a mu Siesta 
evaluate generalization effects than is an experimenter who I s And a 
for only a short time. The teacher can determine empirically i og 
havioral change is lasting and if other behaviors are also changed. The 
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concept of generality is most important for the teacher in improving class- 
room control. Reducing disruptive behavior for a 20-minute session does 
help the teacher, but if the improvement is effective only for those 20 
minutes and the rest of the six to seven hours of classroom time remains 
chaos, the teacher has not been helped very much. Methods must not only 
be devised to program generalization procedures, but periodic observations 
must be made to evaluate these generalization effects. 


Summary 


In reviewing applied classroom research using the dimensions of an 
applied behavior analysis (Baer et al., 1968) as criteria for evaluation, I 
found that the studies in this area should be correctly classified as applied. 
The behaviors dealt with in these studies were of social importance and, in 
most cases, important to the educational goals of the subjects. 


When assessed on the criterion of behavioral, the literature on class- 
room research indicated that the investigators in several studies failed to 
determine measures of the reliability of their measurement techniques. The 
majority of the studies without reliability measures occurred in 
conducted mainly in special classroom settings. Many of the studies without 
reliability measures were conducted during the early period of applied be- 
havioral research, when the research emphasis and the demands of the 
settings seemed to be on a demonstration of change rather than on sophisti- 
cation of measurement techniques. One aspect of behavioral measurement 
techniques that seems in need of investigation is that of teacher recording 
with reliability measures of these recordings. Teachers are in the school 
setting continuously and have an Opportunity to observe a great deal more 

an the average researcher. If techniques could be developed to insure 


reliability of teacher recording, many additional applications of behavioral 
analysis could be carried out in the classroom setting. 


_ Evaluation of applied behavioral studies according to the analytic 
criterion revealed that many studies either did not include control techniques 
or did not present data, which makes it difficult to assess the function, if any, 
of the variables used. Most of the studies in which the investigators analyzed 
their controlling variables used a reversal technique of the ABA or ABAB 
design. Two studies Were reported in which control was demonstrated 
using a multiple baseline technique. It appears that in situations where 
the use of a reversal technique is not satisfactory because of situational 
constraints or factors such as the importance of the behavioral change, the 
multiple baseline technique would prove to be a valuable tool for the 
researcher in classroom settings, 

Another aspect of analysis which is criti 


i cally lacking in classroom re- 
search so far is that of component and para 


metric analysis of the various 
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stimuli involved in the modification of behavior. These analyses may yet be 
made, as the discipline of applied behavioral analysis becomes more firmly 
involved in classroom settings. 

A number of the studies cited failed to meet the technological criterion 


investigators. This conclusion must remain at the level of inference until 
replications are carried out which are clearly the result of information 


In general, all of the studies reviewed appeared to relate technological 
description to basic concepts of the science of behavior. Most studies would 


strated that the responses of children can be shaped to be less Sacra ie 
and more studious, but there are very few data in classroom to 


The last evaluative criterion discussed was generality. It appears that 
generality still has a low priority in classroom research. This is most un- 
fortunate because generality of the students’ behavior can be assessed 


the generality of the teacher’s behavior. Applied behavioral research in the 
classroom involves not only an attempt to change the behavior of a pupil 
or a group of pupils, but at the same time an attempt is made to alter the 
teacher’s behavior in interacting with her students. If it can be programmed 
or demonstrated that the teacher actually uses newly acquired techniques on 
different children and in different situations other than those under investi- 
gation, a greater generality of the processes in an applied behavior study 


marks that teachers appear to use the experimental techniques in new 
situations, but very little data are available to analyze this anecdotal 
generality. eke ; E 
The hypothetical typical teacher’s vior seems in a very geni 
way to approximate most of the criteria of an applied behavior n x 
appears that a closer congruity between these criteria and the teacher's 
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havior analysis of classroom behaviors. However, if the teacher is able to 
better approximate the criteria of an applied behavior analysis, she is then 
in a much better position to apply behavioral techniques effectively to a 
greater variety of classroom problems, and she will be able to evaluate the 
techniques that are used. 
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EXPERIMENTAL FACTORS RELATED TO 
APTITUDE-TREATMENT INTERACTIONS 


Glenn H. Bracht? 
Southern Illinois University 


Bloom (1968), Cronbach (1957, 1967), Gagné (1967), 
Glaser (1967), Jensen (1967, 1968) and other educational psychologists 
have suggested that no single instructional process provides optimal learning 
for all students. Given a common set of objectives, some students will be 
more successful with one instructional program and other students will be 
more successful with an alternative instructional program. Consequently, a 
greater proportion of students will attain the instructional objectives when 
instruction is differentiated for different types of students. 

Glaser (1967) and others pointed out that psychologists have been too 
optimistic in their expectations of formulating general laws of learning and 
have not given sufficient attention to individual differences. In his APA 
presidential address, Cronbach (1957) encouraged psychologists in the 
experimental and correlational disciplines to combine their interests and 
methods to observe experimental effects for subjects of different character- 
istics and to conduct investigations to find aptitude-treatment interactions 
(ATIs). The goal of research on ATI is to find significant disordinal inter- 
actions between alternative treatments and personological variables, i.e., to 
develop alternative instructional programs so that optimal educational pay- 
off is obtained when students are assigned differently to the alternative 
programs. The personological variable in ATI research is defined as any 
measure of individual characteristics, e.g., IQ, scientific interest, or anxiety. 

Although there is an increasing interest in the topic of ATI among 
educational psychologists, very little empirical evidence has been provided 
to support the concept. So few experiments have shown a significant educa- 
tional payoff when students were given differential instruction that Gage 
and Unruh (1967, p. 368) were led to ask: 

. . why has not the evidence from attempts to individualize in- 
struction yielded more dramatic results? Why are not the mean 


scores on achievement measures of pupils taught with due respect 
for their individual needs and abilities substantially higher, in un- 


1This article is based on the author’s doctoral dissertation at the University of Colo- 
rado. The author is grateful to Professor Kenneth D. Hopkins for his contributions to 
the development of the study and suggestions during the course of the project. 
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mistakable ways, than those of students taught in the conventional 
classroom, where everyone reads the same book, listens to the same 
lecture, participates in the same classroom discussion, moves at the 
same pace, and works at the same problems? For the fact is that, 
despite several decades of concern with individualization, few if 
any striking results have been reported. 


To search for an explanation for the paucity of experimental findings 
showing disordinal interactions between alternative treatments and person- 
ological variables, Bracht (1969) conducted a systematic analysis of research 
studies to investigate the relationship of treatment tasks, personological 
variables and dependent variables to the occurrence of ATI. 


A systematic analysis was made of 90 research studies which were 
designed to permit a test of ATI. Two criteria were used for selecting studies: 
(a) the study had to include a comparison of two or more alternative 
treatments for attaining a common set of objectives and (b) the study had 
to include one or more personological variables so the comparison between 
alternative treatments could be made for subjects at different levels of 
the personological variable. In most of the studies the experimenter used 
a treatments-by-levels factorial design with analysis of variance. The studies 
for the analysis were found in a variety of journals, books, project reports 


and unpublished papers. A tabulation of the sources for these studies is 
reported in Table 1, 


For each of the 90 studies, the treatment tasks, personological variable, 
dependent variable and interaction effect were classified into one of two 
categories. The treatment tasks were classified by the degree of controlled 
versus uncontrolled presentation of treatment stimuli. In controlled treat- 
ments the degree of attainment of the treatment objectives was largely 
controlled by the presentation of specific and prescribed treatment tasks, 
and little Opportunity existed for the subjects to be influenced by other 
external conditions. The treatment tasks were classified as uncontrolled if 
the degree of attainment of the treatment objectives was potentially in- 
fluenced by the presentation of a great variety of treatment stimuli or by 
external conditions which were not controlled in the experiment. 


The personological variable was classified as a factorially simple versus 
a factorially complex measure within the domain of cognitive variables. The 
Investigator imagined a large matrix of factor loadings in which the vari- 
ables were measures of cognitive ability and achievement and the factors 
represented general abilities, specific abilities, general achievement and 
achievement in more specific content and skill areas. The personological 
variable was classified as factorially simple if it was judged to have a sub- 
stantial loading on only a few factors in the matrix. Measures of specific 
, abilities, interests, attitudes, personality traits, and social, economic and 
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Table 1 
Tabulation of the Sources of the 90 Research Studies 


Frequency of Frequency of 
studies with studies with 

ordinal or no disordinal 
interaction interaction 


Journal of Educational Psychology 
American Educational Research Journal 
Project reports 

Journal of Educational Research 
Dissertation Abstracts 

AERA annual meeting 

Journal of Social Psychology 

Journal of Experimental Education 
Unpublished papers 

Book 

Journal of Abnormal & Social Psychology 
Psychological Reports 

Programed Learning & Educational Technology 
Journal of Applied Psychology 
Merrill-Palmer Quarterly of Behavior and Development 
British Journal of Educational Psychology 
Science Education 

Elementary School Journal 

School Review 

Educational & Psychological Measurement 
Journal of Experimental Psychology 
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Total 
educational status were classified as factorially simple. The personological 
variable was classified as factorially complex if it was judged to have a 
educational status were classified as factorially simple. The personological 
substantial loading on many factors in the matrix. Measures of general 
ability and achievement in educational programs were classified as factorially 
complex. : 

The dependent variable was classified as a specific versus a general 
measure of the treatment objectives. A specific dependent variable is presin 
to measure the specific objectives of the treatments rather than aoe 
objectives which may not be highly relevant to the treatment tas k 
general dependent variable is intended to assess general ae EGAS rather 
than the specific objectives of the treatments and may include the spun 
ment of some objectives which are not relevant to the treatment and exclude 
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me Fig. 1. Significant interaction effect between two levels of ability and two alternative 
instructional programs. Reported by Krumboltz and Yabroff (1965). 
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Fig. 2. Significant interaction effect between two levels of previous achievement and 
two alternative instructional programs, Reported by Entwisle, Huggins and Phelps (1968). 
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other objectives which are relevant. Most standarized achievemen 
general measures because they assess general objectives which an 
to most curricula in the nation and exclude the measuremen 

specific objectives relevant to a particular instructional treatment 


The empirical results of each study were classified accordit 
type of interaction effect between the alternative treatments and th 
ological variable. Lubin (1961) made a distinction between tw 
significant interaction effect with the analysis of variance. With 
to the graph of cell means, a significant interaction effect is ordin 
the treatment lines do not cross (cf. Figure 1) and disordinal 3 
treatment lines do cross (cf. Figure 2). A number of experimenter 
the crossing of the treatment lines as evidence for ATI, i.e., stud 
be assigned differentially to alternative treatments to còtain optim 
tional payoff. However, Lubin’s distinction does not provide ad 
tection against a Type I error in research on ATI. 


treatment lines is not a sufficient requirement for the existence ol 
ATI. 


action was significant, a graph of siege in Figure 3. When th 
of the personological variable on : 
variable on the vertical axis. If the treatment differ. and the de 
the personological variable were both significant] erences at two 
in algebraic sign, the interaction was judged di y non-zero and 
found for ATI. 8ed disordinal and evides 


Multiple t tests were made be 4 
i ž twee i i 
the interaction effect was significant and = Eem s treatmeni 
ent lines were 


However, the suggested approach 


currently used by many experimenters and conservative than the proceé 
a Type I error. al 


632 


BRACHT rr 


Fig. 3. Flow chart of procedure for testing an aptitude-treatment interaction in a 
treatments-by-levels factorial design. 
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In several research studies the results were analyzed by regression 
techniques rather than by the factorial design with analysis of variance, 
Since the personological variable was not divided into levels, t tests could 
not be computed for treatment differences within specific levels of the 
personological variable. If the regression lines were crossed for the alternative 
treatments, the investigator subjectively judged the likelihood of the oc- 
currence of significant treatment differences within levels of the person- 
ological variable. 

Included in the “other” classification were ordinal interactions and 
nonsignificant interactions. Ordinal and nonsignificant effects were com- 
bined for this analysis because neither provides grounds for differentiating 
instruction for subjects of different characteristics. One treatment can be 
prescribed for all subjects; two or more alternative treatments are equally 
effective for all subjects. Other factors, such as cost and management of the 
alternative treatments, were not considered in classifying the interaction as 
disordinal, ordinal or nonsignificant. 

Although most studies of ATI used the treatments-by-levels fac- 
torial design with analysis of variance, it is generally a less powerful 
technique than regression analysis. Creating artificial levels of a continuous 
personological variable tends to increase the error component in the analysis. 
Regression analysis, however, was used in only a few ATI studies. 
Used even less frequently is the Johnson-Neyman technique (Johnson & 
Neyman, 1936; Johnson & Jackson, 1959) which has special application to 
ATI research. When the experimenter rejects the hypothesis of homogeneous 
regression lines in the treatment groups, the Johnson-Neyman technique can 
be used to define the regions of the predictor space (personological variables) 
in which the treatments are significantly different on the criterion. Thus, 
the Johnson-Neyman technique is used to test for ordinal versus disordinal 
interactions between the treatments and personological variables. Although 
the technique was originally developed for design with two treatments and 
two personological variables, it has been extended to the case of more than 


two treatments and more th i i A 
Abelson, 1953). an two personological variables (Potthoff, 1964; 


Results 


A number of studies included multiple personological variables or 


l ormation about the classifications of the 
ogical variable, dependent variable, and the inter- 
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studies in each of the eight classifications; only the frequency of studies is 
reported in Table 2. Since the quality of the studies varies extensively, 
especially because many of the studies were not designed with the primary 
intent of testing for an ATI, the general quality of research may also vary 
across the eight classifications. 


Table 2 


Tabulation of Studies with Disordinal versus Ordinal or Nonsignificant Interactions 
for Each Combination of Classifications of Treatment Tasks, 
Personological Variable, and Dependent Variable 


Frequency “a Frequency 7) 
Treatment Personological Dependent Percentage percentage 
tasks variable variable studies with studies with 
= 
interaction 
f % 

factorially 

controlled simple specific 2 (2) 4. @ 
factorially 

controlled simple general Dug aM 0 0) 
factorially 

controlled complex specific 56 (52) 1 a) 
factorially 

controlled complex general ae 0 9] 
un- factorially 

controlled simple specific 5 (5) 0 ) 
un- factorially 

controlled simple general ees 9 © 
un- factorially 

controlled complex specific eae 9 w 
un- factorially 

controlled complex general feat ee © 

i a a BB 


i f studies reported 
It is not known to what extent the relative frequency o! 
for each classification in Table 2 is representative of the weem ž a 
studies. Perhaps a large number of studies in one or me Gace. 
tions were never written for dissemination or were not brou 


tion of the investigator. Hence the results are primarily descriptive and have 
limited inferential value. 
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The results of the five experiments with disordinal interactions are 
briefly reported below: 

1. In the experiment by Atkinson and Reitman (1956), Ss low on 
affiliation motive performed better with the achievement orientation treat- 
ment (p<.10), and Ss high on affiliation motive performed better with the 
multi-incentive treatment (p<.05). 

2. Hovland, Lumsdaine and Sheffield (1949) found that presenting 
one side of an issue was more effective for changing the opinion of the men 
who initially favored the opinion of the program (p<.05), but presenting 
both sides of an issue was more effective for changing the opinion of the 
men who initially opposed the opinion of the program (p<.05). 

3. In the experiment by Marshall (1969), Ss from poor educational 
environments performed better on the high-interest task (p<.05) and Ss 
from good educational environments performed better on the low-interest 
task (p<.05). 

4. Thompson and Hunnicutt (1944) reported that introverts obtained 
higher cancellation scores when they received praise (p<.05) and extroverts 
obtained higher cancellation scores when they received blame (p<.01). 

5. Van De Riet (1964) found that underachievers performed better 
when they received reproof (p<.01) and normal achievers performed better 
when they received praise or were asked unrelated questions (p<.01). 


Treatment Tasks 


Tt was hypothesized that disordinal interactions were more likely to be 
found when the presentation of the treatment stimuli was controlled by the 
experimenter. When a variety of treatment stimuli, especially conditions 
not controlled by the experimenter, are able to influence performance on the 
dependent variable, it is unlikely that a personological variable can be found 
to produce a disordinal interaction with the alternative treatments. As the 
number of different types of treatment tasks increases, the number of differ- 
ential abilities which contribute to successful performance also increases. 
Success on a combination of heterogeneous treatment tasks is predicted best 
by measures of general ability, and the degree of prediction is about equally 
high for alternative treatments. 

The results lend support to the hypothesis about the relationship be- 
tween disordinal interactions and the degree of control over the treatment 
tasks of the experiment. The five disordinal interactions were obtained in 
experiments with controlled treatment tasks. The treatment stimuli were 
clearly specified; relatively few different tasks were performed; and the 
probability that external conditions not prescribed for the treatments could 
affect the attainment of objectives was low. In addition, the 103 combina- 
tions with ordinal or nonsignificant interactions were investigated to find 
studies in which disordinal interactions would most likely occur. In 14 
636 


BRACHT APTITUDE- TREATMENT INTERACTIONS 


studies there were ordinal interactions with a crossing of the treatment line. 
With some modification of the treatments or personológical variables in 
these studies, a replication of the experiments may produce evidence for ATI. 
In 13 of the studies the treatment tasks were classified as controlled. 

However, a controlled presentation of treatment stimuli is not a guar- 
antee of a disordinal interaction. In 85 of the 108 combinations, the 
treatment tasks were classified as controlled. Of these 85 combinations, 
disordinal interactions were found for 5 combinations and ordinal or non- 
significant interactions were found for 80 combinations (cf. Table 2). Con- 
trolled treatment tasks may be necessary but certainly are not a sufficient 
requirement for disordinal interactions. 

The degree of task complexity may be a major factor in the occurrence 
of ATI. Although the treatment tasks for most of the 90 studies were classi- 
fied as controlled, the treatments generally were relatively complex tasks. 
Conversely, four of the five experiments with disordinal interactions were 
more similar to the basic learning tasks of the research laboratory. 

An important factor in research on ATI may be the degree to which 
the alternative treatments are interesting to the subjects. Marshall (1969) 
found a disordinal interaction when the alternative treatments were a high- 
interest task and a low-interest task. In the experiment by Kress and Grop- 
per (1966), there was a trend for characteristically fast workers to perform 
better under a fast fixed-tempo condition and characteristically slow work- 
ers to perform better under a slow fixed-tempo condition. These findings 
may be related to the differential interest of the two types of student in the 
alternative treatments. 


Personological Variables 


It was hypothesized that disordinal interactions between alternative 
treatments and levels of a personological variable are more likely to occur 
with factorially simple personological variables than with factorially complex 
personological variables. Measures of general ability and achievement corre- 
late substantially with performance in most complex learning tasks and are 
not likely to correlate differentially with achievement in alternative treat- 
ments. Variables with imputed factorial complexity have relatively high 
correlations with most measures of achievement in complex cognitive tasks. 
Factorially simple measures have substantial correlations with performance 
in only selected cognitive tasks or have relatively low correlations with most 
complex cognitive achievement. Given variables with imputed factorial 
simplicity which correlate substantially with performance in only selected 
cognitive tasks, it seems that alternative treatments can be developed to 
capitalize differentially on specific abilities and personality traits. i 

The results lend some support to the hypothesis about the relationship 
between ATI and the degree of factorial simplicity of the personological 
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variable. Of the five disordinal interactions, four were obtained with factori- 
ally simple personological variables: (a) need for affiliation, (b) a pretest 
of opinion, (c) educational aspects of the social environment and (d) an 
introversion-extroversion scale. In the fifth experiment (Van De Riet, 
1964), the personological variable consisted of groups of underachievers and 
normal achievers formed on the basis of scores from tests of general ability 
and previous achievement. Although the original personological variables 
were factorially complex, the underachievement-overachievement variable 
may be factorially simple. In the 14 studies with ordinal interactions and 
crossed treatment lines, 6 contained personological variables with imputed 
factorial simplicity and 8 contained personological variables with imputed 
factorial complexity. 

Personological variables with imputed factorial simplicity certainly are 
not a sufficient requirement for ATI. In 34 of the 108 combinations, the 
personological variables were classified as factorially simple. Of these 34 
combinations, disordinal interactions were found for 4 combinations and 
ordinal or nonsignificant interactions were found for 30 combinations 
(cf. Table 2). 

Of the 74 combinations in which the personological variable was classi- 
fied as factorially complex, the intelligence test was used most frequently. 
Other cognitive variables were achievement tests and tests of specific abilities 
from the Differential Aptitude Tests, the Primary Mental Abilities battery, 
the ETS Kit of Reference Tests, and Guilford’s Structure-of-Intellect model. 
Interest inventories, personality measures and measures of social, economic 
and educational environment were employed less frequently. Despite the 
large number of comparative experiments with intelligence as a personologi- 
cal variable, no evidence was found to suggest that the IQ score and similar 
measures of general ability are useful variables for differentiating alternative 
treatments for subjects in a homogeneous age group. These measures corre- 
late substantially with achievement in most school-related tasks and hence 
are not likely to correlate differentially with performance in alternative 
treatments of complex achievement-oriented tasks, 


Dependent Variable 


It was hypothesized that ATI is more likely to occur with specific 
dependent variables than with general dependent variables. Since most of 
the treatments consisted of basic learning tasks or relatively short self- 
instructional materials, the dependent variable in most studies was a meas- 
ure of performance during the treatment or a posttest constructed for the 
experiment. In 100 combinations, including the 5 experiments with disor- 
dinal interactions, the dependent variable was a specific measure of the 
treatment objectives (cf. Table 2), Consequently, very little was learned 


from the analysis about the relati hi iabl 
and the occurrence of ATI. Ee Spenden: variohie 
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Discussion 
In experiments on ATI, the experimenters usually identified alter- 
native treatments and then through trial-and-error tried to find personologi- 
cal variables to interact with the treatments. The analysis of an interaction 
effect was often an afterthought rather than a carefully planned part of the 
experiment, i.e., the alternative treatments were not developed with the ATI 
concept in mind. This approach has not been successful for finding mean- 


nature of the alternative treatments and the selection of 

variables. To be differentially effective for various types of students, the 
alternative treatments should demand different abilities for successful per- 
formance. Cronbach and Snow (1969) recommended an analysis of the 
processes performed by the subjects when they were studying a specified 
topic. After the processes were identified, judgments were made about the 
ability (ability A) which was related most significantly to the successful 
performance of the processes (treatment A). Then an attempt was made to 
develop an alternative treatment in which different processes were 

to attain the same instructional objectives. The ability (ability B) to per- 
form the second set of processes (treatment B) was unrelated or only mod- 
erately related to the ability to perform the original set of processes. Hence, 
students with higher ability A than ability B were expected to perform better 
with treatment A, and students with higher ability B were to 
perform better with treatment B. In many studies, the alternative Yeat- 
ment was only some minor modification of some original arapen e 
gram. Experimenters need to move beyond this level and develop alterna’ m 
treatments from a conception of the abilities which are relevant to success 


performance in the alternative treatments. 


dence that the verbal treatment was superior fo) ti 
and the spatial treatment was superior for Ss with low verbal ability. 
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Cronbach and Snow (1969) recommended the development of alterna- 
tive treatments on the basis of general ability. One treatment (treatment A) 
should be designed to rely heavily on general ability, and the other treat- 
ment (treatment B) should be designed to attain the same objectives without 
relying on general ability. Low-ability students should be more successful 
with treatment B and high-ability students should be more successful with 
treatment A. In the 14 studies with ordinal interactions and crossed treat- 
ment lines, the personological variable was general ability or achievement 
in 8 studies. Thus, research with alternative treatments which rely differ- 
entially on general ability may provide evidence for the occurrence of ATI. 


Experimenters should begin to formulate hypotheses about ATI with 
administrative factors, such as cost, in mind. For example, suppose treat- 
ments A and B cost $3.00 and $5.00, respectively, per student. If low-abilisy 
students perform significantly better on B and middle- and high-ability 
students do equally well on both, the following decisions may be made: 
(a) Give treatment A to the middle- and high-ability students and (b) Give 
treatment B to the low ability students. Hence, ordinal interactions may 
lead to decisions about differential assignment of students to treatments 
when administrative factors are taken into account. 


The real test for the concept of ATI will come as more experimenters 
use process analysis for developing alternative treatments. One of the most 
significant contributions to the topic is the excellent project report by Cron- 
bach and Snow (1969). If aptitude-treatment interactions exist, experi- 
mentation and a continuing dialogue among educational psychologists 
should soon help to identify the salient treatment differences and person- 
ological variables which are relevant to the occurrence of ATI. 
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This 

teacher effectiveness. In the review the term “teacher effects” refers to rest- 
dual class mean achievement scores 

or student aptitude is used to adjust posttest 

there have been numerous studies seeking to determine the 
teacher effects within a given sone much less effort has been devoted 
to assessing the consistency of effects across two intervals two 
lessons or two years). This paper reviews and discusses nine om- 
cerned with the generality of teacher effects across two instructional periods- 
Nine studies are too few to support any conclusions, especially the 
designs of the studies were not equally rigorous. Therefore, one of the 
purposes of writing the review is to call attention 

in this area, and to the problems in interpreting the results. It is that 
future studies will be conducted to replicate the existing studies and to help 
resolve some of the issues which they raise. 

The stability of teacher effects might be measured in three situations. 
The most common situation is one in which a teacher teaches the same 
material to different students. The different students could be taught con- 
currently (as in high-school physics classes) or across a, het 
years (as in fifth-grade, self-contained classes). The pa 
would be those in which the teacher taught different ma “aa = 
students (e.g., two units on ‘American history to the same ) or ta 
different material to different students (eg., two units on American history 
to different classes). — 

Ta all three situations the research question is whether « teacher whe 
is effective or ineffective once is equally at eae ihr 
time. Currently the most interesting question ty 


1The first draft of this paper was written while the 
Helpful eas on a first draft were provided by 
ck Hiller, B: a Rosenshine, Frank Sobol, Herbert 
Mich the forel typewriter. The correlations on the CRAFT project are not = 
published report, but they were made availa 
Their help is gratefully acknowledged. 
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effects when they use the same material but instruct different students, 
because this question has the greatest generalizability to the effectiveness 
of teachers in specific instructional programs. The results of such studies are 
also easiest to interpret because the same instructional materials and tests 
are used twice; in the other two situations, different tests and different in- 
structional materials are used, and such differences confound the measure- 
ment. 


The studies of the stability of teacher effects might also be divided 
into “long-term” studies in which the period of instruction lasts for a month 
or a school year, and “short-term” studies in which the period of instruction 
is 30 minutes or less. Although it is possible to conduct three types of 
long-term and three types of short-term studies, I was able to locate only 
four types of studies: (a) long-term studies in which teachers taught the 
same material to different students, (b) short-term studies in which teachers 
taught the same material to different students, (c) short term studies in 
which teachers taught different material to the same students, and (d) short- 
term studies in which teachers taught different material to different students. 


Long-Term Studies of Teacher Consistency 


Long-term studies are presented first and given the greatest importance 
because the teaching situation is most generalizable to “natural teaching” 
as currently practiced. Specific results were obtained in four long term 
studies (Morsh, Burgess & Smith, 1955; Harris et al., Ist grade, 1968; Harris 
et al., 2nd grade, 1968; Soar, 1966), and descriptive statements were 
obtained in the fifth study (Torrance & Parent 1966) (see Table 1). In 
all studies but one (Morsh et al., 1955), the period of instruction was a 
— << hed on on the stability of teacher effects was A 

of the rese: roject; no long-term study was found whi 
focused specifically on this eee n d 


In the earliest study (Morsh et al., 1955), 106 Air Force instructors 
taught the same material on airplane hydraulics to two different classes a 
month apart. Each course lasted for eight daily one hour sessions. A pretest 
similar to the written posttest and grades in previous phases of the program 
were used as predictor variables in the calculation of residual gain scores. 
The computational formula was a special one developed by Morsh which 
used between class but within instructor variances and covariances [p. 181.” 
The correlations between the residual gain scores on the written tests, per- 


formance tests, and combined written and performance tests across the two 
occasions rangd from .32 to .38 (p<.001). 


In Ls CRAFT project (Harris & Serwer, 1966; Harris et al., 1968), 
N justed posttest scores were obtained for two independent samples 
of teachers: 30 teachers who taught similar first-grade classes in two succes- 
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sive years, and 24 teachers who taught similar second-grade classes in 
successive years. In each grade, posttest scores on various subtests in reading 
were adjusted by using all pretest scores as covariates. In the first grade 
(see Table 1) there were correlations of about .5 for teacher effects across 
two years in word recognition and spelling, correlations of about .3 in word 
study and paragraph meaning, and no correlation on the vocabulary subtest. 
However, none of the correlations was above .26 for the sample of 24 
second-grade teachers, and in reading, the area of greatest emphasis, the 
stability coefficient was -.077. 

Soar (1966) studied teacher classroom behavior, student classroom 
behavior, and student achievement in 55 classrooms (grades three through 
six) across two successive years. Following the procedure developed by Lord 
(1962), Soar estimated student residual gain for each of four subtests of 
the Iowa Tests of Educational Development: vocabulary, reading, arithmetic 
concepts, and arithmetic problems. Pretest and posttest measures were 
available for all subtests. Through a factor analysis of both the process and 
product measures, one factor was identified as “achievement gain” because 
it contained high loadings for pupil gain on the four achievement measures, 
and was almost barren of observed teacher or student classroom behaviors 
(Soar, 1966, p. 254). The stability, or correlation, of the factor “achieve- 
ment gain” across the two years was .09. 

In the fifth study (Torrance & Parent, 1966), 63 teachers were 
studied during two successive years while they taught the School Mathema- 
tics Study Group materials to classes ranging from seventh through twelfth 
grade (students 13-18 years old). Residual gain scores were calculated for 
cach teacher, and those teachers in the upper and lower thirds of the 
distribution for each year were selected for further study. Correlations 
between teacher effects across the two years were not given, but the authors 
stated: 


_. . there was some but not a high degree of consistency between 
those in the ‘Most Effective’ and those in the ‘Least Effective 
criterion groups for the two years. In fact, a few teachers in the 
‘Most Effective’ group in 1960-61 were in the ‘Least Effective 
group in 1961-62 and vice versa [Torrance & Parent, 1966, p. 
ads 


These five studies were the only long-term studies on the’ stability 
of teacher effects that I was able to locate. Five studies hardly form a suf- 
ficient basis for generalization, even though the studies were widely discrep- 
ant in settings, subject areas, and students. In three studies the setting 
diminishes generalizability to public school instruction. Morsh et al. studied 

2This quotation contains a correction of a typographical error in the original text; 
the revision was made by E. Paul Torrance (personal correspondence). 
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Table 1 
Stability Coefficients for Four Long-Term Studies 
Investigator, sample, Stability coefficients 
and Criterion tests Eies {or ae i 
length of instruction Test r 
Written test A- idh 
Morsh et al., 1955; Performance test 329000 
106 Air Force instructors; Airplane hydraulics Combined written 
eight one-hour periods. and performance 
test 38° s.. 


Vocabulary 


Word reading 5geee 
Paragraph meaning B94 
one school year. Primary 1) Spelli ‘5oee* 
Word study skills ea 
Reading (Metropolitan | Word knowledge 19 
intel. reais oy ests, | Word discrimination 2 
‘ vanced Primary, | Reading ~ 
niar Form C) Spelling 21 


(Vocabulary, Reading, 
Soar, 1966; Arithmetic Concepts, 
55 teachers, 3rd through Reading and arithmetic Arithmetic Problems) 
` 6th grade; (lowa Tests of Basic 
one school year. Skills) Stability of 
factor labeled 


“Achievement Gain”  .09 


Torrance & Parent, 1966; 
63 teachers, 7th through Math (Sequential Tests i i 
af, gress) 
La. a e aaau 
*p<.10 
**p< 05 
“p< 01 
$990 9 < O01 


Air Force instruction in hydraulics, and Harris et al. conducted experimental 


studies of four methods of teaching readin ; ; 

; A g. Although Harris et al. did not 
find significant differences in student achievement es methods, the fact 
that the teachers were 5 


c using new, specific reading instruction procedures 
Stine, Ais Ha tation and generality of their two studies as examples 

In general, the term “effective teacher” 
teacher remains effective 
studies, evidence on the 
relations as high as .5 


has been taken to mean that a 
across a number of years. Yet, on the basis of these 
consistency of teacher effects is weak because cor- 
were obtained in only one study (Harris et al., Ist 
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grade, 1968), and all other correlations were about .35 or much lower. There 
is a need for further research to establish whether terms such as “effective 
teaching” or “ineffective teaching” have any stable meaning. 

Questions of teacher consistency might also cause one to reconsider 
the results of studies relating teacher behaviors to student achievement; 
there is a need for a hira: Seen: review of the consistency of teacher 
behaviors across time. Present reviews suggest that some aspects of teacher 
classroom behavior are relatively stable, even across successive years (Soar, 
1966). If teachers are consistent in their behavior, but inconsistent in their 
effects, it is no wonder that research in this area has proved to be so difficult 
to pursue and so bewildering in its findings. Finally, the lack of high stability 
coefficients in teacher effects may explain why studies of teacher character- 
istics have proven so futile. Teacher characteristics such as aptitude, atti- 
tudes, marital status, years of education, and number of courses in a given 
field are relatively stable. If these stable characteristics are correlated with 
unstable residual gain measures, we should expect “correlations that 
are nonsignificant, inconsistent from one study to the next, and 
lacking in psychological and educational meaning [Gage, 1963, p. 118].” 


Threats to Internal Validity 


This discussion of potential threats to the internal validity of these 
studies has two purposes: to focus attention on some of the issues which 
might be considered in designing future studies in the area, and to provide 
some guidelines by which to compare the results of the long-term and short- 
term studies of teacher consistency. In the long-term studies, the greatest 
threats to internal validity appear to be nonrandom assignment of students 
to classrooms and inappropriate criterion tests. 

The residual gain scores obtained in these studies may be unreliable 
because there was no random assignment of students to classrooms. It is 
possible that assignment was “random in effect” (Lord, 1962, p. 38) in 
some or all of these studies, but none of the investigators discussed this 
possibility, Without random assignment, residual gain scores are confounded 
because there may be systematic differences between the classes on important 
variables which are not included as covariates (Elashoff, 1969; Cronbach 
& Furby, 1970). In studies of classroom instruction, such variables might 
be neighborhood characteristics, home environment, various aspects of the 
school environment such as student or teacher interest in achievement, 
scholastic aptitudes, or classroom interaction style. The possibility of syste- 
matic differences among classrooms on relevant variables is particularly 
likely in most of the studies in this review because classrooms from different 
schools were included in the same analysis. The use of classrooms from a 
number of schools was necessary in order to obtain a reasonably large 
sample, but random assignment of students to classrooms across schools 
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was, and remains, extraordinarily difficult. Random assignment of students 
to classrooms within schools occurred in the studies by Harris et al., but 
although that randomization can control systematic variations within 
schools, it cannot control systematic differences among the 11 schools in- 
cluded in the study. 


The appropriateness of the criterion tests to the instructional procedures 
may be a second major threat to the internal validity of these studies. In all 
the long-term studies but one (Morsh et al., 1955), standardized achieve- 
ment tests were used as the criterion instruments. Such tests may be in- 
appropriate measures of the influence of teacher behavior because the items 
on the tests may not be relevant to the materials or skills taught in the 
classroom. In many cases, these tests may be measuring the aptitude of the 
learner or the pressure for academic achievement in the home rather than 
the influence of the teacher. A score on a standarized reading test may re- 
present overall performance on 10 objectives, but one teacher may be 
ie with only 2 of these and another teacher with 2 others (Klein, 

In one study (Torrance & Parent, 1966) the teachers used SMSG 
materials, but the criterion instrument was the STEP test in mathematics. 
Although it may be important to learn how students in an SMSG program 
perform on a STEP test, such a test does not appear to be the most appro- 
priate instrument to assess the stability of teacher effects in this situation. 
Even if an SMSG test were used, one would still have the problem of 
different teachers emphasizing different objectives. 


The potential lack of agreement among the instructional materials, the 
teacher’s aims and behaviors, and the criterion instruments pose an ex- 
tremely difficult problem in studying classroom instruction. As one investi- 
gator noted, “If one group of teachers is teaching skills A, B, and C 
very well, and another group is teaching D, E, and F very poorly, what 
will be shown if their classes are tested on skills X, Y, and Z?” (G. 
Nuthall, personal communication). Even if criterion instruments are devel- 
oped which provide separate scores on skills A through Z, it remains ex- 


tremely difficult to measure the stability of teacher effects if different 


teachers emphasize different objectives. 


In the study by Morsh et al. on Air Force instruction, there seems to 


have been the greatest congruity between the teachers’ objectives and the 
pane sg where there were highly specific course objectives, common 
elning aids, and specific criterion tests. Yet even in such a situation, al- 

ough the stability coefficients were highly significant (p <.001), they 
were relatively small (r’s=.32 to .38). One might claim that in such a pre- 
scribed situation there would be little variation in teacher effects, but such 
a claim cannot be verified because measures of dispersion of gain scores 
were not presented in the final report. However, there was sufficient vari- 
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ation in residual gain scores and student behavior in different classes to 
yield a multiple R of .59 between student inattentive behaviors (by class) 
and the class mean residual gain scores. But the Air Force study is the only 
long-term study I found in which such specificity existed. 


It may be possible to develop designs which permit random assignment 
of students to teachers across schools, and which focus the teachers’ atten- 
tion upon the criterion tests. Such designs would enable us to study whether 
teacher effects are stable under conditions designed for greater internal 
validity. Unfortunately, such situations will not be representative of the 
instructional world as it currently exists, and so the gains in internal validity 
will also represent losses in external validity. 


Aside from the measurement of gain scores and the selection of appro- 
priate criterion tests, there are other threats to internal validity which might 
be controlled in future studies: mortality, events between pretest and start 
of instruction, and the combining of scores from different grades. 

Mortality refers to the possibility that the students who remain in 
a class throughout a school year might be affected by the loss of some 
students and their replacement by others. Three investigators (Morsh et al., 
1955; Soar, 1966; Torrance & Parent, 1966) did not discuss mortality, 
and there was probably little attrition in the military or middle class situa- 
tions which they studied. In the studies by Harris et al., which dealt with 
disadvantaged urban children in the primary grades, 17% to 37% of the 
students who took the pretests left the school before the posttests were given. 
The larger loss occurred when first-grade posttests were used as pretests for 
the second grade. 

The use of pretests administered before the summer may also attenuate 
the stability coefficients either because of the different events which the 
students experienced during the summer (e.g., some may receive tutoring) 
or because the growth of the students during the summer may be differ- 
entially affected by the classroom behavior of the teacher of the previous 
year. In the study by Soar (1966), “indirect contro ” by the teacher during 
the academic year was found to produce more growth during the summer 
than extreme use of “direct control.” In future studies of this type, problems 
of mortality and the effects of the summer might be reduced by administering 
the pretests near the start of the instructional session. 


Test Reliability 


The internal consistency of the tests was probably quite high because 
all the investigators but one (Morsh et al., 1955) used standardized achieve- 
ment tests. Morsh and his associates developed their own tests and reported 
that the tests were revised until the investigators were satistied with the 


statistically determined internal consistency. 
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Two investigators reported the correlations between pretests and post- 
tests (by individuals) for all grades in their samples and for cach of two 
years (Harris et al., 1968; Soar, 1966). Together, these studies yielded test- 
retest correlations for grades one through six. In grade one the correlations 
ranged from .3 to 4; in grade two the correlations ranged from .6 to .7 
(Harris et al., 1968). This jump from grades one to two may be due to the 
use of reading tests as pretests for grade two, whereas reading readiness 
tests were used as pretests for grade one. The actual correlations between 
pretests and posttests which were used to calculate the residual gain scores 
were probably higher in the two studies by Harris et al. because all pretest 
scores were used in a multiple regression equation to calculate the gain 
scores. The test-retest correlations reported by Soar (1966) steadily increased 
from .5 to .7 in grade three, to .6 to 8 in grade six. In the studies by both 
Harris and Soar the correlations were very consistent for the two samples 
in each grade. 


Short-Term Studies of Teacher Consistency 


In short-term studies the total period of instruction is reduced, and in 
the studies presented below, no teacher taught longer than 30 minutes. A 
major advantage of these studies is that they can be designed to control 
for the threats to internal validity discussed above. In practice, each investi- 
gator controlled only for certain aspects of design, but as a group these 
studies contain a variety of ideas which might be incorporated into future 


The studies by Fortune are particularly important because the design, 
procedures for selecting students, and tints for adjusting posttest scores 
were the same in all three situations; any differences among the stability 
coefficients across these situations cannot be easily dismissed. In these 
studies each teacher taught Topic A to Group 1 and again to Group 2. Each 
pae also taught Topic B to Group 1 and Topic C to Group 2. Although 
each teacher taught only four lessons, the design allowed for the calculation 


of six correlation coefficients, each indicative of one of three forms of 
teacher consistency (see Figure 1). 


. In the first study (Fortune, 1966), the instructors were 30 teacher- 
trainees randomly selected from a group of 42 teacher-trainees enrolled in 
an eight-week Head Start Training Program. Each of the teacher-trainees 
worked ata different Head Start Center. The 30 trainees were randomly 
divided into two groups. The first group taught four lessons on transporta- 
tion concepts (eg., truck, sailboat); the second group taught four lessons 
on community workers (e.g., doctor, mailman). The data on the two groups 
were analyzed separately. i 
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Figure 1 


Teaching Situations for Measuring Three Types of Teacher 
Consistency in Short-Term Studies 


Topic A to 
Group 1 
Topic A to 
Group 2 
Topic B to 
Group 1 


Topic C to a 
Group 2 


a = the same material to different groups 
b = different material to the same group 
c = different material to different groups 


Students in the area were not randomly assigned to centers; however, 
each teacher’s class was randomly divided into two learner groups for this 
study. The size of the groups ranged from six to nine students. Each of the 
two student groups in each class was taught two lessons by its teacher and 
was shown lessons on videotape. The raw performance scores of the learner 
groups were adjusted for learner group differences and for the difficulty of 
the particular lesson taught. The specific regression formula was developed 
by Fortune. 


In the second study (Fortune, Gage & Shutes, 1966), 30 teacher-trainees 
in high-school social studies taught four lessons on social studies topics. The 
students from various local schools, who were participating in a summer 
microteaching clinic, were randomly divided into “classes” of five students. 
The raw performance scores of the learner groups were adjusted by regres- 
sion for learner group differences and for the difficulty of the specific lesson 
taught. In contrast to the first study, there were no videotaped lessons 
viewed by all students. 

In the third study (Fortune, 1967), the students were all the fourth-, 
fifth-, and sixth-grade students at a university-related school. Intact classes 
were randomly divided into halves for instruction; the classes had not been 
formed at random originally, but no mention was made in the report of 
any bias in assignment. There were 15 student-teachers in English, 14 in 
mathematics, and 13 in social studies; each taught four lessons, on three 
topics, to two groups of students. The design for instruction followed that 
outlined in Figure 1. The class scores were adjusted using the same regression 
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formula used in the previous studies, and separate results were calculated 
for the student-teachers in each subject area. In the other two short-term 
studies (Belgard, Rosenshine & Gage, 1968; Justiz, 1969), only one of the 
three situations for measuring teacher consistency was studied; these two 
studies will be described as the results are discussed below. 

One advantage of the design of all five short-term studies is that by 
providing teachers with specific instructional material, and by developing 
criterion tests based on the material, one could expect to obtain greater 
congruence between the teacher’s instructional objectives and those meas- 
ured by the test. The possibility of such congruence was further enhanced 
in two of the studies (Fortune et al., 1966; Belgard et al., 1968) by providing 
the teachers with 5 of the 10 posttest questions a few days before the 
instructional period. (In neither study was there any significant difference 
in student achievement between the 5 questions the teachers had been given 
and the 5 they had not been given.) 


Same Topics to Different Students 


Of the three designs for measuring teacher consistency, the one re- 
quiring teachers to teach the same topic to different students appears most 
similar to the five long-term studies reported above (see Table 1). In the 
three studies designed by Fortune, there was one instance in which teachers 
taught the same materials to different students. In two of the studies 
(Fortune, 1966, 1967), the students were random halves of the same class, 
and in the third study the students were stratified by grade level and then 
randomly assigned to classes. The stability coefficients (see Table 2) were 
quite consistent. In five of the six samples, the correlations ranged from .45 
to .70, and four correlations were significant at the .05 level or better. 

In the long-term studies (see Table 1) in which teachers also taught 
the same topics to different students, only 2 of the 12 correlations (see Harris 
et al, grade 1, 1968) were above .40. The discrepancy between the stability 


coefficients obtained in the long-term studies and those obtained in these 
short-term studies is striking. 


Different Topics to the Same Students 


consistency across these topic: 
tency across different parts o! 


The three studies by Fortune and his associates were designed so that 
each teacher had two opportunities to teach different material to the same 
students (see Figure 1), One “class” was taught Topics A and B, and the 


656 


ROSENSHINE TEACHER EFFECTS UPON STUDENT ACHIEVEMENT 


Table 2 


Consistency of Results in Short-Term Studies when Teachers Taught 
the Same Topic to Different Groups of Students 


Headstart Social Studies 
Social Studies 
Social Studies 
4-6 English 15 
Math 


Fortune (1966) 


Fortune (1967) 


Fortune et al. 


(1966) 7-9 Social Studies 40 e adaa 


second “class” was taught Topics A and C. This procedure yielded two 
stability coefficients for each sample of teachers (see Table 3). 

In the fourth short-term study (Belgard et al., 1968), experienced 12th- 
grade social studies teachers taught two 15-minute lessons on successive days 
to their regular classes. Random assignment of students to classes was im- 
possible; the 43 participating teachers and their classes were located in 
schools dispersed throughout the San Francisco Bay area. On the third day, 
all students heard an identical audiotape lecture on a third topic, and 
class mean scores on the test following this lecture were used to adjust 
posttest scores on the other two lectures for differences in the ability of the 
classes to learn from Jectures. The lecture material and criterion tests were 
selected from the same materials which had been used in the study by 
Fortune et al. (1966), and teachers were provided with 5 of the 10 criterion 
questions. 

The stability coefficients for seven samples of teachers in the four 
studies are presented on Table 3. Compared to the consistent and usually 
significant correlations obtained when teachers taught the same material to 
different classes (see Table 2), the correlations in Table 3 are quite different. 
One would expect that, when different topics are taught to the same stu- 
dents, the correlations would be lower than when the same topics are 
taught to different students since the instructional materials and posttests 
woul be different and the teachers could be using different teaching 
methods in the two situations. The correlation coefficients were lower in 
almost all cases, although I was not prepared for the negative correlations 
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reported by Fortune (1967). The exception on Table 3 is the moderately 
high correlation (r = 47, p <.001) obtained when experienced teachers 
lectured on different topics to their regular classes (Belgard et al., 1968). 
The sample studied by Belgard et al. consisted of experienced teachers who 
volunteered for the study and were restricted to lecturing during both 
lessons. Unfortunately, no comparable data on experienced teachers are 
available in the other short-term situations. There is a need for further 
studies employing experienced teachers and restricted teaching methods. 


Table 3 


Consistency of Results in Short-Term Studies When Teachers Taught 
Different Topics to the Same Group of Students 


Social Studies 
Social Studies 
*p<.l0 
- p < 001 
Different Topics to Different Students 


In the studies by Fortune and his associates, the desi itted 
í > gn permitt 

oa ones riage to be computed for each sample of teachers 
1). Th pis erent topics to different groups of students (see Figure 
S e i correlation coefficients for each of the six samples of teachers 
he T ethe 4. These correlations are the most perplexing of all; 
; i. S to be higher in both directions than those presented on the other 
al ca i comparison to the correlations on the other tables, one would 
a a ose on Table 4 to be consistently lowest because in this situation 
the seater nit materials, posttests, and students are different, and 
nema caine aan a methods in the two situations. Yet, such 
this saad sioe Bac hes ei ears of -.45, -.42, and .42 within 


In contrast to the studies by Fortune, the results obtained by Justiz 
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(1960) across two samples showed amazing consistency and statistical signi- 
ficance. Each of the student-teachers taught two 30-minute lessons on sub- 
jects not currently taught in senior-high schools. Students were randomly 
assigned to newly constituted classes (12 to 18 students in a class), teachers 
were randomly assigned to classes, and each teacher taught the two lessons 
to different classes. Class mean posttest scores for each teacher were used 
to compute the Spearman rank order correlations. Separate coefficients were 
reported for each of the two schools used in the study. 


Table 4 


Consistency of Results in Short-Term Studies When Teachers Taught 
Different Topics to Different Groups of Students 


Investigator 


Headstart Social Studies 15 2i, 42, AS 

Fortune (1966) 
Headstart Social Studies 15 -08, 13, 5 
42 
Fortune (1967)a 4-6 English 15... +35. Dt, ss 
4-6 Math 14 -.29, 05, 82 
Bae eraa 7-9 Social Studies 40 -10, Ol, 43 


(1966) 


— High Varied 10 638 (05) 
Justiz (1969) 
Senior High 908 (01) 


School 


aRank order correlation using posttest scores only. All other correlations are product- 
moment using adjusted posttest scores. 


Summary and Suggestions for Future Research 


The available studies of teacher consistency can be classified into four 
types: (a) long-term studies in which teachers taught the same material to 
different students, (b) short-term studies in which teachers taught the same 
material to different students, (c) short-term studies in which teachers 
taught different material to the same students, and (d) short-term studies 
in which teachers taught different material to different students. Positive 
and consistent correlations were obtained in both (a) and (b), but the 
size of the correlations and the percentage of significant results were highest 
in (b). 

The consistent results in (b) must be considered carefully because all 
these studies were conducted by Fortune et al., and the same materials, 
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criterion tests, students, and methods of adjusting posttest scores which 
were used in (b) were also used in deriving the stability coefficients in the 
other two short-term settings—(c) and (d); yet the results were quite 
different. When teachers taught the same topic to different students, the 
stability coefficients were all positive (r’s=.22 to .70), and four of the six 


AÙ), and none was significant. When teachers taught different topics to 
different students, the direction of the correlations was again erratic (r’s = 
AS to 82), and few correlations were significant. 

The overall conclusion from the three studies by Fortune is that there 
stability of teacher effects in only one of the three situa- 


i 
f 
3 
i 
3 
: 
2 
3 


appears to distinguish this situation from 
situations is control over variation in 
vior. ne did not present data on the stability of 
behavior in the various situations, it is possible that the least amount 
occurred when teachers taught the same topic twice. In the 


ij 
F 

? 
i 
i 


ji 
[ 
: 


other situations, the teacher-trainees taught different topics, and because 
they were allowed to instruct in any manner they wished, there may have 
pad ae variation of behavior in those situations. Similar control over 


teacher behavior was maintained in the study by Belgard et al., 
teachers were limited to lecturing, even though 
they taught two different topics. In this study, the stability coefficient was 


: 
F 
! 


It is gh /Aepaosad the aare to which instructional materials 
posttests contribute to the consistency of teacher 

effectiveness. This variable appears to distinguish the short term studies in 
which teachers taught the same materials to different students from the long 
re studies of similar design. However, instructional materials related to 
posttests were also used in the other short-term studies, and the results 
were erratic in those other two situations. Unfortunately, only teacher- 
trainees were used as subjects in almost all of the short-term studies, and we 


opportunity to learn each of 
or by coding the items on the posttest (See Husén, 1967), 
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posttest item (see Rosenshine, 1968; Shutes, 1969). A measure of the content 


relevance 


of the lesson might be used as a covariate or as a stratifying vari- 


able, or it could be used to deselect from the sample those teachers whose 
teaching was not equally relevant across two occasions. 


The 


current long-term studies show that one cannot use the residual 


achievement gain scores in one year to predict the gain scores in a successive 
year with any confidence. But only a few such studies exist, and not all of 
these have equal generalizability to natural classroom teaching. On the 
basis of the short-term studies, one might suggest that stability coefficients 


can be increased if there is greater 


congruence 
classroom and the items on the posttests, and less variation in the behavior 


of the teacher across successive instructional 
research might be to determine whether attention to these variables will 


£ 


reasonable stability coefficients in long-term studies. 


It is currently possible to conduct numerous long-term studies 
samples of 50 to 200 teachers. Such studies can be conducted 
from Title I or Title IM projects which contain teachers 


for two o 


districts where students are tested annually in a district- 
gram. When several teachers work together in the same project, 
of the consistency of the project effectiveness can be obtained. 
studies the investigator need only work with 
collected and will collect in the future. 

In future studies which utilize these data, the statistical 
not be limited to a replication of the existing studies. A refined procedure 
computing residual gain scores is currently available (Cronbach & F 
1970). More complex designs can be used in which classes are blocked 


oF 
HE 


r more years; such studies can also be conducted 


f 
Fit 
zal 
$ 


it 
i FE 


g 


st 


school, by pretest level, or by other characteristics. Future investigations 
need not be limited to the study of teacher effects across successive years; 
it is possible to use the data from several years when they are available. Nor 
should future studies be limited to establishing whether teachers are con- 
sistent compared to other teachers in the sample; investigators can also 
determine whether teachers are consistent in bringing their classes to 
specified levels of competence. 

If more studies of teacher consistency are conducted, we may obtain 


clearer information on the v 


ariables which influence teacher consistency 


and on the stability coefficients which might be expected in different situa- 


tions. At 


present there appears to be a significant lack of knowledge in an 


area of concern. 


Belgard, M., Rosenshine, B., 
Evidence on its general 
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DIFFERENTIAL WEIGHTING: A REVIEW 
OF METHODS AND EMPIRICAL STUDIES’ 


Marilyn W. Wang’ 


Julian C. Stanley 


The Johns Hopkins University 


When measures a 


re to be combined to form a composite 


measure or to predict a criterion, the question of differential weighting of 
the component measures arises. Can differential weighting improve the 


reliability of the composite or prov: 


ide a more valid composite than would 


be obtained if the component measures were merely summed or averaged? 
Theoretically the answer to this question should be “yes.” It is unlikely 
that the component measures will be equally reliable, have equal variances, 


be equally intercorrelated with one 


another, and be equally correlated with 


the underlying variable which the composite is to measure or with the exter- 
nal criterion which is to be predicted. However, since each of these charac- 
teristics of the component measures will be reflected in the composite 
measure, it is to be expected, on purely logical grounds, that differential 


weighting would be effective. 


When a criterion measure is available, multiple-regression techniques 
provide a set of weights optimal for minimizing the error of prediction for 
the group on which the weights were derived, under the usual assumptions 
of normality and linearity of regression. Or, alternatively, the weights may 
be chosen so as to maximize certain internal criteria such as the reliability 
of the composite measure. All methods weight most heavily those component 
measures which are “best” according to the particular criterion adopted, and 
they weight least, perhaps negatively, those which are “worst.” 

McDonald (1968) offered a “unified treatment of the weighting prob- 


lem,” a general procedure for ob 
variables. The procedure includes 


regression, canonical variate analysis. 


ability, canonical factor analysis, 
McDonald’s procedure yields certai 
transformations of the varia 


1Preparation of this review was supported 


Examination Board. 


taining weighted linear combinations of 


, principal components, maximum reli- 
and some other well-known methods. 
n desirable invariance properties across 


bles. Although the approach is not discussed 


by a grant from the College Entrance 


2Now at the University of Pittsburgh School of Medicine. 
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further here, it is applicable to a considerable part of this survey since it 
simplifies some of the seemingly diverse procedures of the past half-century, 
Although differential weighting theoretically promises to provide sub- 
stantial gains in predictive or construct validity, in practice these gains are 
often so slight that they do not seem to justify the labor involved in deriving 
the weights and scoring with them. This is especially true when the com- 
t measures are test items and much less true in other circumstances, 
late psychologists to conclude that weighting, especially item weighting, 
is not worth the trouble (Guilford, 1954; Gulliksen, 1950). 
Item weighting is not the only type of weighting which has been in- 
Se em In most interest and personality tests, some form of option 
ting occurs, ie., the subject’s score on an item depends on which 
5. mae or response category he selects or prefers. Often there are many sets 
weights which are applied successively to the answer sheet to derive a 
score for the subject on a number of different scales. Although it has not 
been studied extensively in the past, differential weighting of options on 
aptitude or achievement tests is also possible. In fact, it was proposed (de 
Finetti, 1965; Shuford, Albert & Massengill, 1966) that the reliability and 
validity of tests may be increased if the subject himself assigns weights to 
the options according to his confidence in the correctness of each option. 


This paper is devoted to a systematic study of differential weighting. 

In the first section different types of wide. and methods of scii 
weights are discussed, and the mathematical restrictions which limit the 
of many sets of weights are pointed out. The second section 

contains a summary of empirical investigations of weighting in each of the 
typical situations where weighting has been considered potentially useful. 
In the final section consideration is given to response-determined scoring, 
a term which implies that a range of scores is possible for a given test item 
br that it is the subject’s response pattern which determines his score. Con- 

lence weighting methods are an example of this. 


Weighting and the Derivation of Weights 


It is customary to define the weight of a single variable in a composite 


as the contribution of that variable to the variance of the composite. If there 


Sernin oq te tie posite, the variance of the composite may be 


ula for the variance of a sum: 
Var(X:+Xs+ ... +X,)=Var(X,)+Var(X,) +... [1] 
+Var(X,) +2Cov(X,X,) +2Cov(X,X;)+ ... +2Cov(Xn-1Xn). 


Equation 1 indicates that the variance of 


th . 
Be radar + aul pte toh: e composite is equal to the sum 


ance terms. If all n? terms are ar- 
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ranged in a symmetrical matrix of order n, the contribution of the ith vari- 
able to the variance of the composite C; is given by the sum of the terms in 
the ith row (or the ith column) of the matrix: 


C, =Cov(X;X;) +Cou(XiXs) + . . . (2) 
4+Cov(XiX 4-1) +Var(X%) +Cov(XiXin) t... +Cov(X;X,). 


The natural effective weight of the ith variable is thus given by its own vari- 
ance plus its covariance with each of the remaining variables. The variables 
in the composite are equally weighted only if they make equal contributions 
to the total variance, i.e., if the sum of the elements in each row (column) 
of the variance-covariance matrix is equal to a constant. 

This definition of weighting implies that the composite measure for a 
single individual has little significance in and of itself and that its meaning 
depends on the distribution of composite measures for all individuals. This 
is probably not an unreasonable assumption, since so much measurement 
in educational and psychological research is based on ordinal or interval 
scales and since population norms are required before an individual's score 
may be interpreted. Occasionally, however, individuals’ scores do have 
sufficient intrinsic or arbitrarily agreed upon meaning to be interpreted 
without reference to a distribution of scores. When this is true, the above 
definition of weighting is not appropriate. 

Equation 3 gives the variance of composite scores when each of the 
component measures is multiplied by a nominal weight, w;, before being 
summed: 


Var(wiX,+waXe+ .. - +WaXq)=witVar(X?)+ws'Var(Xs)+.-- [3] 
+ wa?Var(Xq) +2w:wsCov(X1X2) +2wiwsCov(XsXs) + - + - 
+ 2Wy-1WpCov(Xn-1Xn)- 


The contribution of the ith variable to the variance of the composite thus 
becomes 

C;=w;w,Cov(XiX;) + wiwsCov(XiX2) + - + - [4] 
+wiwi-Cov(X:Xi-1) +w: Var (Xi) + wiWwinCov(X:Xin) +. 
+WiWnCov(X:Xn). 

Whereas the w;’s constitute the nominal weights assigned to the variables, 
the effective weight of each variable is determined by its contribution to the 
variance of the composite. It is a common misconception that the nominal 
weights correspond to the relative weights of the variables in the composite. 


Although the nominal weights do influence the effective weights, as may 
be seen in Equation 4, they are not generally proportional to them. 
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Equation 4 may also be expressed in terms of the intercorrelations of 
the variables. Since Cov(X;X,) =p,,071,, the formula becomes 


Cy Swi pi.0 0, + WWP TH ... + [5] 
WIWi-aPisi-a TiTi- FW T? HWW Piin TOT +... H WWT Ta 


From this it is clear that the contribution of the ith variable to the total 
variance depends on (a) the nominal weights, (b) the variance of X;, 
(c) the n—1 correlations between X; and the other variables in the composite, 
and (d) the standard deviations of the other variables in the composite. 

If each of the n variables is expressed in standard form, {;=(Xi-ji)/ 
Gi, or (less restrictively) if each X; is divided by oj, then all variances 
and standard deviations of the resulting scores will be equal to unity and 
will disappear from the formula giving 


Ci=wiwpi +WiWPint ... + WiWi-apii- t [6] 
WETWWi Pit . . . +WiWaPin. 

Thus, when standard scores are weighted, the effective weight of the ith 
variable is determined by the nominal weights and the intercorrelations of 


the variables. If unit weights are assigned to variables with unit variances, 


the effective weight of a variable is approximately proportional to its aver- 
age correlation with the other variables: 


J 
Cj=1+ 2 Pij=1+ (n-1)Pis [7] 
=1 
where i * j, and where Pi; is the mean of the n — 1 correlations between 
the variable and the other variables in the composite. $ pi; is equivalent 
jai 


to the covariance of the ith variable with the total score on the remaining 


variables, Cov (X;, 3 X;), where i # j. 
s1 


From the foregoing discussion it should be clear that nominal weights 
will not in general be proportional to effective weights. Also, variables will 
rarely have equal effective weights unless the nominal weights have been 
derived specifically to ensure this result (e.g., see Kaiser, 1967, for a way 
to make all Cov (X;X;) zero). Otherwise, using unit nominal weights with 
standard scores probably comes closest to achieving equal effective weighting, 


particularly if the average correlation of each variable with the others is 
nearly constant. 
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There is one situation, however, in which the nominal weights are 
always directly proportional to the effective weights and in which the 
customary definition of effective weighting is not appropriate. Assume that 
a teacher decides to assign letter grades to her class on the basis of a semester 
score which is expressed as a percentage. The following scheme 
might be adopted: A=90%—100%; B=80%—89%; C=10%—-19%; 
D=65%—69%; F=below 65%. Five examinations are given during the 
semester and the score on each is expressed as the percentage of items 
answered correctly. The semester score is the average of these examination 
scores. Here it is appropriate to say that the five examinations have been 
equally weighted in determining the final grade, regardless of the distribu- 
tion of scores on any of the examinations or the intercorrelations of the 
examinations, Likewise, if a weighted average of the examinations were to 
be taken, the effective weight of each would be directly proportional to the 
nominal weight assigned to it. This is true because the semester score of 
each pupil may be interpreted without reference to the distribution of semes- 
ter scores. Moreover, since the several examinations contribute to this score 
in direct proportion to the weights assigned to them, these nominal weights 
are also the effective weights. 

This point can be seen even more clearly at the item level. Suppose 
that the teacher administers a five-question test and assigns 35 percentage 
points to one of the questions. No pupil who fails the question earns @ grade 
of C or better. If all pupils fail the question, they all earn grades of D or 
F, even though scores on the question have zero variance and covary zero 
with scores on each other question. Absolute grading on an arbitrary scale 
differs in this way from grading each pupil relative to the performance of 
the others in his class. 

This situation is contrasted with one in which the examinee’s semester 
score is interpreted with reference to the distribution of semester scores. A 
familiar example is the use of standard scores or percentiles. An examinee s 
letter grade then depends on his score and the variance of the semester- 
score distribution. It is for this reason that the effective weights of the 


examinations are assessed in terms of their contributions to this variance. 


Methods of Weighting Variables 
{ weighting which have been and 


There are a great many methods o i 1 
continue to be used in educational research. In each method discussed in 


this section, the entity to be weighted is a quantitative variable, usually 
a test score or a test-item score. Most often the weights derived are the 
nominal weights for raw scores. However, many derivations aerate 
are considerably simplified when the measures are assumed to have unit 
variances. When this is true, it is implicit that the nominal weights may 
be redefined to absorb the standard deviations of the variables. 
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Natural weights. When n raw scores are simply summed or averaged 
to form a composite measure, the effective weight of each variable is deter- 
mined by Equation 2. Since no effort is made to control the effective weights 
they are called natural or random weights. The term “random” does not 
imply that differences between the weights occur by chance. Real differences 
in the variances of the variables and differences in their intercorrelations 
are simply allowed free rein in determining the weights. It should be noted 
carefully that natural weighting corresponds to what is usually considered 
no weighting. The measures are unweighted only in the sense that the 
nominal weights are unity. 


A priori weights. When nominal weights are assigned on the basis of 
judgments or ratings or by some similar procedure, the weights are termed 
a priori weights. The decision not to weight, i.e., to assign unit nominal 
weights, is a special case of a priori weighting. 


A priori weighting frequently occurs when different sections or ques- 
tions on an examination are weighted differentially. For example, 20 true- 
false questions on a test might be allowed one point each, whereas 20 
multiple-choice questions on the same test might be worth four points each. 
Items of the same type may also be differentially weighted on an a priori 
basis. Corey (1930) had instructors rate each item of an objective test in 
psychology on a 7-point scale according to its judged importance for a 
general knowledge of psychology. The rating then became the weight for 
the item. Weighting on an a priori basis is also very common in personnel 


pi where certain job criteria may be deemed more important than 
others. 


Although there are important empirical methods available for deriving 
nominal weights, such methods are not necessarily preferable in all situ- 
ations. Burt (1950, p. 122) concluded that a priori or subjective weighting 
may be necessary when questions of value are concerned or when the 
resulting measure is genuinely a composite. A priori weighting is most 


appropriate when it is actually used to define the nature of the composite 
measure. 


The empirical method for deriving weights which is probably most 
familiar to psychologists is multiple linear regression, although it is only 
one of several least-squares solutions which have been used to derive 
weights. These and other methods are discussed below and their major 
advantages are noted. However, since the actual mathematical derivations 
of the weights are readily available elsewhere, they are not presented here. 


3 HEA mana If a criterion measure, X,, is available, so that the 
£ i's x ich form the composite may serve as predictor variables, the 
classica multiple regression equation gives predictor weights which maxi- 
mize the correlation between the composite score and the actual criterion 
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score. This solution also minimizes the mean squared error of prediction, 
given that the regression of the criterion on the predictors is linear. 

The general form of the equation when all variables are expressed in 
standard-score form is 


E=Bassucebe + Borss..ng2 +... + Bonrz.n-rbns [8] 


A 
where {, is the estimated criterion score for an examinee. The B-weights 
in Equation 8 are the nominal weights for standard scores. The -weight 
of the ith predictor is the partial regression coefficient of the criterion on 
that predictor, i.e., it is the regression of that part of the criterion which is 
independent of all the other n — 1 variables on that part of č; which is 
also independent of them (Kelley, 1923). 

If each ; in Equation 8 is replaced by its equivalent, (Xi — pi)/oi, 
and if the equation is then solved explicitly for Xo, then a new set of 
weights may be defined by the relation b = B(o./o;), where the subscripts 
on b and B have been dropped for convenience. The b-weights are the 
nominal weights for either deviation-from-the-mean predictor scores or for 
raw scores. In the latter case however, the regression equation must include 
an additive constant which incorporates both po and the b-weighted 4i’s. 

Two properties of the regression weights, other things being- equal, 
are noteworthy. First, the larger the correlation is between the variable and 


the criterion, the larger is the weight. Second, the more independent is the 


variable of the n — | other variables, the larger is the weight. 

In Equation 8 it was assumed that the criterion was a single variable. 
However, Ryans (1954) considered the problem of weighting from the other 
side of the fence, i.e. weighting the components of a criterion to arrive 
at a suitable measure. Hotelling (1935) presented a method called canonical 
correlation for assigning weights to two batteries (one of which might 
serve as a criterion) which would maximize the correlation between them. 
If there are several criterion variables, it may be desirable in some applica- 
tions to derive weights that produce a composite with a specified intercor- 


relation matrix (Green, 1970). O ; 
Multiple regression requires considerable caution in interpretation. 


Although predictor weights are those which maximize the multiple correla- 
tion, R, within the sample on which they were derived, most often the 
weights are derived in one sample and then used to predict the criterion 
either in the population or in another sample. Because of sampling fluc- 
tuations, the weights capitalize on idiosyncrasies of the derivation sample 
and the obtained multiple R is spuriously high. This is especially true if the 
number of predictors is high relative to the number of cases in the sample. 
in the predictors contributes to this indirectly by 
increasing sampling fluctuations. Thus the sample multiple R is not an 
unbiased estimator of either the population multiple R or the multiple R 
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which would be obtained in a crossvalidation sample. The larger the number 
of predictors is, the more unstable are the sample regression weights and 
the more shrinkage the multiple R is likely to show. Wherry (1931) and 
Lord (1950) presented formulas for estimating the population and cross- 
validation multiple R, repestively, and Uhl and Eisenberg (1970) verified 
empirically, over different sample sizes and numbers of predictors, that 
Lord’s formula was superior for predicting multiple R in a crossvalidation 
sample. A comprehensive treatment of the parameters of crossvalidation was 
given by Herzberg (1969). 

Wolins (1967) considered the problem of bias in the regression weights 
when the predictors are fallible measures. Cureton (1951) also discussed 
this. As the correlation between two fallible variables approaches the limit 
set by their respective reliabilities, the differences between the variables 
increasingly reflect error variance rather than true-score variance. As the 
intercorrelation between the variables drops, proportionally more of the 
difference between the variables is due to true differences rather than error, 
and the bias in the regression weights thus drops. The estimate of the 
multiple correlation squared, however, is only slightly affected by bias in the 
regression weights since, as the correlation rises, multiple R is less and less 
dependent on the actual values of the regression coefficients. 

In terms of efficiency, multiple-regression techniques are most useful 
when there are only a few predictor variables and, as the number of pre- 
dictor variables rises, when the predictors are relatively independent. (Sup- 
pressor variables constitute an exception, since they increase the multiple R 
as their correlation with the other predictors rises.) 

Equal contributions to total variance. It is sometimes desirable to ensure 
that the variables in a composite have equal effective weights. This might 
occur if the composite is truly intended to be a composite rather than a 
measure of some hypothesized underlying unitary entity, as when each of 
n judges assigns ratings and it is desired that the judges’ opinions have 


equal weight. In the absence of an external criterion, equal effective weights 
may be deemed appropriate, : 


A special case of this me 
are given unit (or a priori) 
orthogonalizing procedure. 


Equal correlations with the composite. When no external criterion is 
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weights. Kaiser ( 1967) presented an interesting 
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available, weights may be derived by the method of least squares to equalize 
the correlation of each variable with the resulting composite score (Wilks, 
1938). The correlation of &; with wig: + wWele+ .-- +wWnaln is given by the 
equation 
n 
witdwjpi; 
j=z1 


mmm Verei. [9] 


3 wi? + D3 5 WiWjPij 


isl i=1 j=1 


Setting all such correlations equal to some arbitrary constant p and solving 
for the weights is equivalent to setting the numerators equal to p and 
solving for the weights, since the denominator is a constant. This method 
is logically defensible only if none of the variables are negatively correlated. 

Minimum generalized variance. Generalized variance represents an ex- 
tension of the concept of variance. Whereas the variance of scores on a 
single variable reflects the extent to which the scores are concentrated about 
a point, the mean generalized variance reflects the extent to which points 
in an n-dimensional space are concentrated within a space of n — 1 dimen- 
sions. Wilks (1938) applied the criterion of minimum generalized variance 
to the weighting problem as follows. An n-dimensional space is defined by 
the n variables which form the composite and, with all scores in standard 
form, the score of the pth person on the ith variable, Xip, is represented as 
a point in the space. A linear function of the X;’s is sought such that kd 
any given value of the function, the generalized variance of individuals 
having that value is minimized. 

Minimum variation. In 1936 Edgerton and Kolbe presented a method 
for combining a number of measures of the same underlying variable based 
on the criterion that the sum of the squares of the n(n — 1)/2 differences 
between standard scores for an individual on each of the n variables be 
a minimum. In other words, intra-individual differences in standard scores 
were minimized. Horst (1936) suggested a method for deriving a set Ne 
weights which would maximize the differences between composite scores or 


all pairs of individuals, i.e., maximize inter-individual differences. Interest- 


ngly, this approach leads to weights which are proportional to those ob- 


tained with the former criterion. Edgerton and Kolbe, noting that the two 
methods yielded equivalent results, maintained that their ana Paes 
computationally simpler, although with the advent of high-speed electronic 
computers this distinction lost its importance. an 

Maximum reliability. In the absence of an external criterion, probably 
no alternative weighting criterion is so frequently seized upon as monn 
reliability, especially if the variables in the composite are tests whi co 
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prise a battery or the items of a single test. The maximum correlation which 
may be obtained between two variables is limited by their respective reliabili- 


ties: Pry = Pt, t, V PexPyy Where pr,:, is the correlation between true scores 


on variables X and Y. Since reliability in measurement is a necessary, 
though not sufficient, condition for a valid instrument, weighting for maxi- 
mum reliability has long ben deemed a worthy enterprise. 

When reliability is defined in terms of the proportion of total com- 
posite variance which is true-score variance (or one minus the proportion 
which is error variance), the reliability of the composite Y is given by the 
following formula when all variables are expressed in standard form: 


n n 
2 Wi? Pii $ > Sy WiW pi; 


t=1 i=1 j=1 
Py! = , where i # j. [10] 
n n n 
2 w t z >3, WiW Pi; 
i=. i=l j=1 


From this it is apparent that the reliability of the composite may equal 1.00 
only if every pi; also equals 1.00. Likewise, if every pii’ is zero, Pyy' must 
be zero because each pi; is also zero. Mosier (1943) discussed the effect 
on composite validity of the interrelationships among the variables. If the 
variables are mutually uncorrelated, the reliability of the composite is the 
weighted mean of the item reliabilities p; where each is weighted by wi”. 
He noted that this conclusion is of particular interest because when multiple 
regression is used for prediction, every attempt is made to obtain predictors 
that are independent or nearly so. It may be noted from Equation 10 that 
for a given set of individual reliabilities and weights, the reliability of the 
composite increases as the positive intercorrelation of the components in- 


creases, although the unreliability of th i limit 
to the size of these correlations, ps ee 


Thomson (1940) derived formulas in matrix form for the maximum 
battery reliability and for the weights which give this reliability. Peel (1947, 
1948) showed that Thomson’s formulation could be considerably simplified. 
Peel (1948) also gave equations for the weights which maximize the correla- 
tion between a predictor battery and a complex criterion, which is itself a 
composite with fixed weights, > 

When the composite measure 


À th is a test score and the n variables are 
test items, it is customary to evalu 


rr r ate single-form reliability of the test in 
terms oi internal consistency, Internal consistency reliability is estimated by 
a numberof well-known formulas, including the Kuder-Richardson formulas 
and the formula for coefficient alpha. Kaiser and Caffrey (1965) and Bentler 
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(1968) presented factor analytic methods for maximizing internal consist- 
ency. Kaiser and Caffrey’s alpha-factor analysis maximizes the internal 
consistency of common-factor scores, whereas Bentler’s analysis maximizes 
internal consistency of observed scores. 


Validity vs. reliability. Since methods are available for computing the 
weights which give maximum validity and the weights which give maximum 
reliability, an interesting question is, “What is the effect on reliability of 
weighting for validity, and vice versa?” Since the two sets of weights are not 
at all likely to be proportional, weighting for one of these criteria will result 
in a less-than-maximal value of the other. 


The lack of correspondence between the two sets of weights is primarily 
attributable to two factors. First, when other factors are held constant, 
weighting for validity results in weighting the more valid variables more 
heavily while weighting for reliability results in weighting the more reliable 
variables more heavily. Unless the more reliable variables are also the more 
valid ones (with respect to the observed correlation with the criterion), the 
correlation between the two sets of weights will not be perfect or even 
nearly so. 

The second factor which affects the two sets of weights differently 
is the intercorrelation of the Xi’s. It was noted earlier that, when other 
factors are held constant, the variables which are more independent will 
receive higher multiple regression weights. When weights are derived 
to maximize reliability, however, high positive intercorrelation of the vari- 
ables provides internal consistency and stability and thus, other things 
being equal, the variables which are more highly correlated with the re- 
maining variables receive more weight. This may be seen readily from 
the formula which Mosier (1943) derived for assigning the weight to the 
pth item which would maximize composite reliability, when the qth item 
was taken as a reference and assigned a weight of 1.00: 


È wipip(1— pad) 
iz. 


w= eer ere tet? q [11] 


3 Wipio(1—Pop') + Pad + Pov! 


=1 


Note that the sum of the intercorrelations of the reference item appears 
in the denominator of all weights and that the sum of the intercorrelations 
of the pth item appears in the numerator. Of two items with the same 
reliability, the one with the higher total intercorrelation with the other 
items will have the higher weight. 

Since weighting for maximum composite reliability or internal con- 
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sistency may lower the validity of the composite measure, under what cir- 
cumstances is weighting for reliability recommended? The answer to this 
question must be stated in the negative: Weighting for reliability should 
be avoided unless (a) there is no satisfactory external criterion available 
on which to base multiple regression weights, (b) it is not possible or 
feasible to manipulate reliability directly rather than statistically, and 
(c) it may be assumed that weighting for reliability will not significantly 
decrease the (possibly unknown) validity of the composite measure. 
There are at least two distinct ways in which condition (a) may be 
satisfied. The first occurs when, as we have so far assumed, each of the n 
variables is hypothesized to be a measure of a single underlying variable, 
and when there is no single measure of that variable against which to 
validate the composite measure. The second occurs when the composite 
measure is not assumed to be unidimensional, but rather, its dimension- 
ality is to be determined empirically. In this case a factor-analytic model 
of the structure underlying observed scores is appropriate, and a number 
of alternative criteria, including maximum internal consistency for each 
factor, may be used to extract this structure from the observed scores. Al- 
though factor analysis may properly be classified with other methods used 
to derive weights, a full consideration of factor-analytic methods and em- 


prical studies is beyond the scope of this review. (However, see Lord & 
Novick, 1968, ch. 24.) 


Condition (b) reflects the fact that manipulating the reliability of a 


see Cronbach, 1970; Cronbach, Gl j ; and 
Bena eat) nba eser, Nanda, & Rajaratnam, 1970; an 


Not all investigators would a ree that all th f b di- 
tions need be met for weighting ago F the above aa 


Thomson ( 1940) Suggested that, 
able, one might wish to increase both validit 
that they should be equal and by then solvi 
maximizes both under this restriction. 


: l nts is removed. It is incorrect however if condi- 
on (c) is not met and if reliability is increased solely by differential 


Weighting by 1/o;. The authors of tests frequently wish to eliminate 
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the influence of unequal standard deviations on the effective weights of 
the several variables which form a composite. Weighting each measure 
by the reciprocal of its standard deviation accomplishes this. Using 
standard scores has the same effect; it also subtracts the mean py from 
each measure. Some testers mistakenly believe that this ensures equal 
weighting, which is not true, of course, unless the variables are uncor- 
related or equally correlated. If they are not, the intercorrelations deter- 
mine the effective weights. Terwilliger and Anderson (1969) found that 
with more than five variables in the composite, and with a moderate 
degree of intercorrelation among the variables, the effects of standardiza- 
tion were negligible. 


If no particular significance is attached to the inequality of the several 
variances, then removing this source of unintended weighting may be ap- 
propriate. There is at least one instance, however, in which this would 
not be true. Richardson (1941) presented an example similar to the 
following. Suppose X; is the number of items answered correctly on a 
50-item test, and X» is the number answered correctly on a 100-item test 
in the same subject. Xa will undoubtedly have a larger standard devia- 
tion than X;. But it is also true that the longer test will in general be a 
more reliable test, and hence a more valid one. If the scores are merely 
summed, the longer test will automatically have the larger effective weight. 
This works in favor of both the reliability and validity of the composite. 
Weighting by the reciprocal of the standard deviation would deny that 
any such difference between the tests existed and would thus work against 
the reliability and validity of the composite, although the longer test 
would still be weighted more heavily than the shorter one. 


Weighting by Length. The above example raises the question as to 
whether or not tests should be weighted in terms of their length. The 
idea of weighting by length can probably be traced to the practice of 
expressing examination grades as the percentage of items answered correct- 
ly. An unweighted combination of such percentages gives equal nominal 
weighting to each test. However, if one test consists of 50 items and a 
second of 100 items in the same subject area, then the second test is 
literally equal to two of the first. Weighting each percentage in terms of 
the length of the test on which it was computed has the effect of convert- 
ing the percentages back to a score equal to the number of items answered 
correctly, thus giving each item equal nominal weighting. 

If the tests are in different subject areas, it is not so clear that each 
item should be equally weighted, since the significance of a single item 
may differ markedly from one subject area to the next. A single lengthy 
algebra problem cannot be considered equivalent to a single vocabulary 
item, The item is a meaningful unit only when items measure the same 
thing or very similar things. Likewise, even when items do supposedly 
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measure the same thing, it might be argued that different forms of items, 
€g, true-false vs. multiple-choice, are not equivalent units. Often, then, 
it may be desirable to work with percentage scores, perhaps weighting 
these on the basis of other a priori or empirical considerations. 

Weighting by difficulty. Another popular weighting method is 
weighting by difficulty. Very often such weighting is implied rather than 
explicit, as when a teacher assigns different weights to sections or items 
of a test on the basis of an intuitive feel for their difficulty or “worth” 
rather than some conviction concerning their intrinsic validity. Some- 
times, particularly with standardized tests, the weights are derived from 
an empirical estimate of item difficulty. The weight is usually equal to 
the proportion of those taking the test who fail to answer the item cor- 
rectly. 

The logic of this type of weighting is based on the conviction that 
knowing a very difficult item is evidence of considerably more ability or 
achievement than knowing a simple one. Testers have failed to realize, 
however, that this is equivalent to penalizing the student more heavily 
for missing a difficult item than for missing an easy one; this is a rather 
counter-intuitive strategy. Weighting by the proportion of examinees 
passing the item would reverse both of these effects, giving less credit for 
a difficult item and penalizing less for failing on one. As long as there 
is but a single set of weights which is monotonically related to difficulty, 
one cannot have one side of the coin without the other. 

It is interesting to note that when dichotomously-scored items which 
vary in difficulty are given equal nominal weights, a certain amount of 
natural weighting-by-difficulty occurs, although this weighting is not 
a monotonic function of difficulty. As the difficulty of an item deviates 


from .5, the maximum phi coefficients for that item with the other items 
of the test become smaller; 


contribution to total varian 
tend to be less heavily weighted than items of .5 difficulty. 

Rasch (1960) develo 
kinds of tests, which includes two 


examination material. Although dif- 
an examinee’s 
= v loped within the context of classical 
est-score theory and his procedure is not subject to the gëneral criticism 
of weighting by difficulty which is made above. 

Weighting by validity. When it is not feasible to carry out a full- 
scale multiple-regression derivation of appropriate predictor weights, very 
often the raw correlation with the criterion is used as an approximation 
of the optimal weight. If standard scores are weighted in this manner. 
however, the only factor unaccounted for is the intercorrelation of the 
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predictors. Since in a test with fairly homogenous items the average inter- 
correlation of each item with the others may be nearly constant, the ap- 
proximation may be a very good one indeed. The approximation is worst 
when the average intercorrelation varies markedly from one item to the 
next and when raw scores are used which differ markedly in variance. 

Guilford (1941) presented the following formula for weighting test 
items, X4: 


WEPT. 5 [12] 
Ti 


where pic is the correlation of the ith item with the criterion, and Fe is 
the standard deviation of the criterion measures. Equation 12 is the fami- 
liar regression weight for X; when it alone is used to predict the criterion. 
It is only an approximate multiple-regression weight, however, because it 
does not incorporate the intercorrelations among the predictors. The 
equation is equally useful in providing scoring weights for response op- 
tions to items in which there is no correct response, and in which a 
weight is derived for each of the response options in turn. 

Clark (1928) presented an Index of Validity for evaluating the items 
of a test: 


IV =(P—D)/(1—D), [13] 


where D is the percentage of the group taking the test who fail the item 
and P is the percentage of a criterion group who fail the item. For a 
given item the criterion group is composed of the D percentage of the 
class who rank lowest in terms of total score. Although Clark seems to 
have intended his index as a measure of the discriminating power of an 
item, at least one investigator, Peatman (1930), has used it as an item 
weight. 


The Effectiveness of Weighting 


When a weighted linear composite is obtained with a given set of 
weights, the effectiveness of that set of weights relative to other sets is 
determined by the factors discussed in the section on effective weights. 
However, it is possible to make certain generalizations concerning the 
limits of the effectiveness of any set of weights relative to another set, 
regardless of the method used to derive either one. 

It is well known that if the correlation of each of two variables with 
a third is known, then the limits of the possible values of the correlation 
between the two variables is determined (Stanley & Wang, 1969). For 
example, if each of two variables correlates .90 with a third variable, then 
the correlation between these two variables must lie in the range 
62 = p = 1.00. Therefore, if the correlation between two weighted com- 
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posites is known, and if the correlation of one of the composites with a 
criterion is known, then the limits of the correlation between the second 
weighted composite and the criterion is determined. Even if validity co- 
ts are not available, the correlation between the two weighted com- 
posites gives some indication of the limits of the effectiveness of either 
weighting method over the other. For example, if two different sets of 
weights produce composites which correlate .99. then regardless of the 
correlation of either composite with the criterion, adopting the alternative 
set of weights could not be expected to affect that correlation very greatly. 
A number of authors, notably Wilks (1938), Richardson (1941), 
Burt (1950), and Gulliksen (1950), presented formulas for the correlation 
two weighted sums. Since Gulliksen’s formula is the most gen- 
eral, and since he discussed important special cases, it is closely followed 
here. If n standard scores are weighted by the set of weights v; and w, 
the respective composite scores may be denoted by X, and Xw». If no re- 
strictions are imposed on the values which the weights may assume, the 
correlation between X, andX,, is given directly by the following formula: 


Faw, + Š 3 ViW;pij 


izi izi jan 


E a 7 7 


Sop 4 2 3 ViV;Pi; 


ial isi jar 


Pr, = 


Sw, 3 5 WiW;pPij 


iz1 f=1 j=1 


where ixj. In Equation 14 the weights are expressed in raw-score form. 
The sums of squares and the cross-product sums, however, may be ex- 
pressed in terms of the means, variances, and covariances of the two sets 
of weights: 


(1 - pis) (Cov(vjw;) +o) + (në - n)Cov((v;w;) pij) +n 00s 


[15] 


n(1- Piu) (o? + 52) 
t a= n)Cov( (viv) p;;) 
+ nD 


n(1 - Diz) (Tw? + 0?) 
+ (n? - n)Cov((wiw;)pi;) 
+ nŪ’pi 


with i» j. The correlation between the two weighted composites thus 
depends on n; the mean values of the two sets of ee at ib; the 

lati » T,” and 07°; the average intercor- 
relation of the variables to be combined, p;;; the covariance of the two sets 
of weights, Cov(v;w;) 3 and the covariance of a product of weights with a 
corresponding correlation, Cov((viw;)p:;). To see what happens to this 
expression as n Increases, we may divide the numerator and denominator 
by n®, and eliminate all terms which have 1/n as a factor: 
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Cov((viw) pis) + pu 
pi a ————o——=—_——_[_{Z££{{{£_=— ee bj. [16] 
/ Cov( (ow) pin) +EP Cot (wae) pu) +B", 


This expression equals unity if the covariance terms are equal and if the 
mean v-weight equals the mean w-weight. If the covariance terms are 
nearly zero, such that they can be ignored, then the correlation approaches 
unity regardless of the mean values of the weights. The information in 
Equations 15 and 16 was summarized by Gulliksen (1950, p. 320) as fol- 


lows: 


px X 


e 


l. If either or both 3 and @ may be zero, px x, may assume any 


value regardless of the value of pj), n, or the various covariance terms. 
2. If J and Ū are small in relation to o, and Gw, Px „t depends 


primarily on the four covariance terms and is relatively insensitive to 
changes in the values of pi; and n. 


3. If only positive weights are used, such that 0,/0 and Cw/Ū are 
less than unity, the correlation between the two composites obtained by 
using two different sets of weights approaches unity as (a) the correla- 
tion between the two sets of weights is increased, (b) the average intercor- 
relation of the variable is increased, and (c) the number of variables in 
the composite is increased. It should be noted particularly that the last 
effect holds even when the correlation between the two sets of weights is 
zero, provided pi; is greater than zero. As the standard deviation of the 
weights is increased in proportion to the mean weights, Px,x,, approaches 


unity regardless of the values of Pij, V, and w. ta 

From these deductions it is clear that there are very real limits on 
the effectiveness pr linear weighting method, particularly when the 
number of predictor variables is large and only positive weights are used. 
Under these conditions even random sets of positive weights will produce 
composites which are highly correlated. When the weights have been 
derived according to some logical rationale, the correlation is likely to be 
very high indeed. y 

Gulliksen concluded that from a practical point of view, 50 to 100 
variables are probably sufficient to make differential weighting unprofit- 
able, and the same conclusion holds if the variables are very highly cor- 
related. Weighting may be worthwhile, he contended, when there are 
few (3 to 10) variables and when their average intercorrelation is low (.50 
or less). Even then, however, the weights must have an appreciable 
standard deviation if they are to differ significantly from unit weights. 
Finally, if two sets of weights are being considered, and if the weights 
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are themselves highly correlated, it can make little difference which set 
is used, 

A word of caution is in order concerning the dismissal of the weight- 
ing question under conditions of high correlation between differently 
weighted composites. It was noted above that it is easy to determine the 
range of possible validity coefficients for one weighted composite, given 
the validity of another composite and the correlation between the com- 
posites. This range is affected equally by both correlations, and as either 
drops from unity the range increases. Thus, if the validity is only 
moderately high, even a highly correlated alternative composite may in- 
crease the validity significantly. McCornack (1956) criticized a great 
many empirical studies of the effectiveness of weighting for failure to take 
this into account. Quite often investigators report only that two com- 
posites correlate better than .90, or some higher figure, without reporting 
or even investigating whether validity is the same for both composites. 
Yet this is the question of most importance. In such studies, and they 
are numerous, the conclusion that weighting is not worthwhile, or that 
two methods of weighting will result in essentially equivalent scores, is 
not completely justified, even if it happens to be correct. 

A similar notion is that if the correlation between two composites 
is greater than the reliability of either, then it does not really matter 
which composite is used. The argument, which has been advanced by many, 
including Burt (1950), is that if two tests (composites) are more highly 
correlated with one another than either is with itself on another adminis- 
tration, then either one should be acceptable. But this still does not allow 
for the possibility that one version will have a higher validity than the 
a Although this may be unlikely, it must be recognized as a possi- 

Two factors which 
to be considered: the size 
and the form of the distri 

In the section on m 
weights are derived on th 
error will cause a 
when the weighted 
lation or in a new sample. 
regression weights are to be e 


the sample on which the wei t i i 
deriving weights for the V veiga s are derived should be fairly large. In 
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served ineffectiveness of various weighting methods, in most personnel 
situations and in most classrooms, large samples are simply not available. 
In an interesting empirical study of the effects of sample size on predic- 
tive validity, Lawshe and Schucker (1959) compared four sets of weights 
(including unit weights), based on sample sizes of 20, 40 and 90, in three 
test batteries, each with a different average subtest intercorrelation. They 
found no differences in crossvalidated predictive validity due to any of 
these factors, although there was an effect due to sample size in one bat- 
tery with respect to multiple regression weights. They concluded, however, 
that more research on sample size was needed. 

Most weighting methods assume that the X’s have either normal or 
point distributions, or that they at least have similar distributions. This 
assumption is most important when tests of significance are performed 
or when point estimation is involved. Failure to satisfy an assumption 
of normality may have other consequences. Cliff (1960) investigated the 
effect of unlike distributions on the contribution to composite variance 
made by two tests, one of which was positively skewed, the other of which 
was negatively skewed. It was hypothesized that summing standard scores 
on the two tests would not result in equal contributions to composite vari- 
ance at various cutting points in the composite-score distribution. By 
computing the contribution of each variable to composite variance, it was 
demonstrated that the positively skewed variable contributed more to com- 
posite variance in the upper percentiles and the negatively skewed variable 
contributed more in the lower percentiles. Had the variables been sym- 
metrically distributed, they would have contributed equally throughout the 
distribution. 


Methods of Weighting Response Categories 


In the preceding sections the entity which was to be weighted, Xi, 
was a quantitative variable capable of taking at least two values. A sub- 
ject was assigned a score on each of these variables, and the scores were 
then weighted and combined to obtain a composite score. In this section 
a different weighting problem is considered. The subject's response to 
a single item of a test or inventory may be classified into one of a small 
number of mutually exclusive response categories which do not initially 
have numerical values associated with them. The response category is 
typically defined as the item option which the examinee selects or prefers. 
The weighting problem is to determine a set of weights for the response 
categories (options) in order to derive an item score for the examinee, 
The problem is not very different conceptually from that of scaling the 
response categories in order to assign the examinee the scale value of 
the category he selects. J tah 

Occasionally, it is possible to order response options on an a priori 
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basis. In attitude inventories, for example, the responses may represent 
different degrees of agreement or disagreement with the attitude expressed 
by an item. Likert (1932) found that empirically scaled response weights 
were no better than a priori weights ranging from one to five for the 
options. Similarly, Gage (1957) and Yee and Kriewall (1969) found 
that test scores on the Minnesota Teacher Attitude Inventory were neither 
more nor less valid when logical weights were substituted for the recom- 
mended empirically derived weights. Yee and Kriewall noted a tendency 
for their logical scoring system to yield a more reliable test measure, 
however. : 


Weighting with an external qualitative criterion. Although attitude 
tests lend themselves readily to a priori keying, often, for example, with 
interest tests, it is not possible to determine logically which responses 
should be weighted most heavily. 


Consider a single item, X, to which a subject’s response may be 
classified into one of m mutually exclusive categories. The item might 
be a personal, biographical, or demographic question, or it might be an 
item of a personality, attitude, or interest test. The responses of two or 
more criterion groups to the item are available. What weight should be 
assigned to each of the m categories in order to best estimate whether the 
subject’s response is more typical of one group or the other? 


Strong (1943) presented an historical survey of methods of weighting 
responses of an interest test. His discussion is the basis of the brief summary 
which follows. The methods have in common the characteristic that the 
criterion to be predicted is qualitative, usually consisting of membership 
in a particular group. Although Strong was concerned specifically with 
the responses “dislike,” “indifferent,” and “like” (which could be ordered 
logically), his methods are appropriate whenever a number of mutually 
exclusive categories are weighted for diagnostic purposes. 


_ In 1924, Ream used the following rationale to weight the items of an 
Interest test. He had a successful group of life insurance salesmen and 
an bop iin group respond “like,” “indifferent,” and “dislike” to 
a set f on For each response to each item he calculatd the propor- 
tion of t ose in the two criterion groups who selected the response. When- 
ever the difference between the two proportions exceeded the standard 
error of difference, the item was retained; the option was scored +1 if 
the difference favored the Successful group and —1 if the reverse were 


true. Opti i ; spe 
weighted FORN which did not discriminate between the groups were 


A more recent example of a very simi i 
AEN: r r y similar approach to response weight- 
ing is doe in Anastasi, Meade and Schneiders (1960). Response weights 
were determined according to the significance of the difference between 
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the proportions of those in the reference groups choosing the response 
and the direction of that difference. 

A different method was used by both Cowdery (1925) and Strong 
(1930), based on a formula which had been suggested to them directly 
by Truman L. Kelley: w;=@j-/(1 - $j.*)@}, where w; is the weight as- 
signed to the jth response option, dj. is the correlation between marking 
the option and being in the criterion group, and oy is the standard devia- 
tion of the response distribution. Thus, the differentiating power of 
each option was evaluated relative to all the other options pooled. The 
j;-/0; portion of the equation is proportional to the regression weight 
which would be assigned to the option if it were used to predict the 
criterion. l - jc? is approximately proportional to the square of the 
standard error of pje (Kelley, 1923). 

In 1934 Kelley revised the formula, stating that instead of being 
proportional to the square of the standard error of #jc, the multiplicative 
constant should be proportional to the square of the error of the weight 
itself, ¢j¢/o;. An appropriate working formula was derived and was 
adopted by Strong for scoring the SVIB. 

Guilford (1941) criticized the practice of incorporating a reliability 
factor in the weight, arguing that unreliability should provide a basis for 
determining the limit of acceptability of an item, but that it should not 
be allowed to directly influence the size of the contribution of the item 
to the total score. Kelley’s aim was to increase the reliability of the com- 
posite test, and Guilford’s criticism was similar to that advanced earlier 
against weighting for reliability. i 

Both Strong and Kuder (1934) contrasted a single criterion group with 
a combined group of “men in general.” Porter (1965) tried a contin- 
gency-table chi-weighting procedure for securing weights simultaneously 
from several criterion groups. He hypothesized that for similar occupa- 
tions his procedure would differentiate better than Kuder’s, whereas for 
dissimilar occupations it would not. Porter considered the Kuder Pre- 
ference Record, which is composed of items which require that three op- 
tions be ranked in terms of preference; for three options there are six pos- 
sible rankings, and each of these may be defined as a “response.” If ae 
occupational groups are considered, this yields a 6x5 table of eat e 
weight for a given response pattern on a given occupational scale was 
given by chi, the signed difference between the number of responses in 
the cell and the theoretical number for that cell based on the marginal 
frequencies, divided by the square root of the theoretical number. barn 
chi-weighted responses were then summed over items to get a score for 
the occupation. Although Porter’s findings were somewhat equivocal be- 
cause of an incorrect key to one of the interest scales, his results tended to 
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Weighting with an external quantitiative criterion. In 1941, Gutt- 
man discussed at length the weighting of response categories. He 
showed that if one wishes to predict a quantitative external criterion Y 
from an item score X obtained by assigning weights to each of the re- 
sponse categories of the item, the correlation ratio nyx* will be a maxi- 
mum if each category is weighted by the mean criterion score of persons 
in that response category. Such a weighting scheme produces perfect 
regression of criterion scores on item scores, over examinees, where the 
item score equals the weight assigned to the examinee’s response category. 
The weighting scheme also maximizes pxy (Stanley & Wang, 1970). 

If a number of items are available, response weights for each may 
be determined by the above procedure, and multiple regression weights 
for the item scores may be obtained. The effectiveness of item weighting, 
of course, would be limited by all the factors discussed earlier, Guttman 
made the simplifying assumption that the items were independent and 
used the unweighted mean of category weights to determine the total 
score. 

Wherry (1944) discussed a special case. In his analysis the external 
criterion was expressed as a pass-fail dichotomy, scored 1 or 0. Wherry 
showed that, for a single item, the response weights which maximize the 
point biserial p between item score and criterion are weights which are 
proportional to the proportion of passers in each response category. 

_ For a theoretical study of Guttman’s procedure as applied to option 
weighting, see Merwin (1959) or Ramsay (1968). 

Weighting with an internal quantitative criterion. When there is 
no external criterion, and when the total score on a set of items has mean- 
ing only insofar as it permits differentiation among examinees, it is pos- 
sible to derive a set of response weights which maximizes the correlation 
between response weight and total score over subjects and over items. 
This maximization of internal consistency is achieved when the score for 
a person is the mean of the response categories he chooses, and when the 
weight for a response category is the mean score of the persons choosing 
that category (Guttman, 1941), 

Since there is no a priori correct answer i can 
satisfy this condition. Both Lawshe and Harris ties and ap 
pasts iterative Procedures for obtaining the response weights. In the 
srt ~ Hens procedure, the responses were first given a priori 
bie mee ris Subjects score was calculated by averaging the weights 
oa ie abi ee weight for each Te 
subject’s score was th s the mean score of the persons who chose it. The 
ease as then revised according to the new weights, and the 
iterations were continued until the weights and scores stabilized. 


The three classes of tesponse-weighting methods presented above can 
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also be used to weight test-item options when there is an a priori correct 
answer. If an external criterion were available, it would be possible to 
maximize the validity of the individual items by differentially weighting 
the distractors. Of course, it would be necessary to ensure that the cor- 
rect option has a higher weight than any other option. If no external 
criterion were available, the internal consistency of the test could be 
maximized by differential weighting of the distractors. A recent empirical 
study of this type of weighting by Hendrickson (1970) is discussed in 


a later section. 


Crossvalidation and response weighting. As with multiple-regression 
techniques, it is true with optimum response-weighting techniques that 
weights derived in a particular sample must be crossvalidated before reli- 
ability and validity coefficients can be reported with confidence, since 
there is likely to be shrinkage when the original weights are applied to 
a new sample of responses (e.g, see Shick, 1957). For this reason some 
investigators, such as Strong and Anastasi et al. (1960), look for statis- 
tical significance before assigning weights to response options. Small dif- 
ferences in the responses of criterion groups, while real in the sample, 
might well be due to sampling fluctuations. If Guttman’s methods were 
used in a large-scale testing program, crossvalidation of derived weights 
would be extremely important. 


Empirical Studies of Weighting 


Empirical studies of weighting far outnumber analytical ones. A 
great many early testmakers either incorporated weights into their tests 
as a matter of course (e.g, see Pintner, 1920; Wright, 1929; Yerkes, 
Bridges & Hardwick, 1915) or tried out one or two methods before making 
a decision on the weighting question (e.g., see Anderson, 1925; Bovee, 
Holzinger & Morrison, 1925). Both the Yerkes-Bridges Point Scale and 
the Kuhlmann-Anderson Intelligence Test incorporated weights of some 
form, and in addition, a great many less well-known tests incorporated 
some type of weighting scheme. Because the number of studies is so large, 
and because the findings tend to agree so strongly, each of the os 
sections is arbitrarily selective. In view of the factors considered above 
which limit the effectiveness of weighting, it is not surprising that the 
empirical studies tend to agree with one another. 

Although empirical studies of weighting deal sno with the 
weighting of tests, subtests, test items, and test-item options, the man 
ing question has also been explored in other areas. Other ou det 
formation to which weighting methods may profitably be applied include 
economic, anthropometric, and psychological indices (Scates & Fauntle- 
roy, 1938; Stromgren, 1946), biographical or personal inventories (Cong- 
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don, 1941; Wherry, 1944), and especially ratings (Jurgensen, 1955; Tiffin 
& Musser, 1942). 


The Weighting of Tests 


In 1931, Scates and Noffsinger reported a study of factors which in- 
fluenced the effectiveness of weighting when a number of tests in a battery 
were combined to form a composite. A 6-test battery was given to 80 
subjects and a 10-test battery was administered to 26 subjects. Four meth- 
ods of weighting were compared: (a) natural weighting of the raw scores; 
(b) a priori weighting based on the opinion of a committee of judges; (c) 
modified a priori weighting; and (d) weighting by the reciprocal of the 
standard deviation. The results were presented as the correlations between 
the various obtained composites. For the 10-test battery these correlations 
ranged from .943 to .985, and they were interpreted as evidence against 
the effectiveness of artificial weighting over natural raw-score weighting. 
A fairly high intercorrelation of the tests may explain the high correla- 


tion of the composites. However, the differential validity of the composites 
was not examined. 


A later study by Wesman and Bennett (1959) illustrated the more 
direct approach in which the validities themselves were compared for one 
weighting method vs. a second method vs. natural weighting. The tests 


were subtests of the Psychological Corporation’s College Qualification 
Tests which included 70 verbal items, 50 numerical items, and 75 general- 


information items. A multiple regression analysis was carried out in seven 


different samples and the weights which were derived were then cross- 


validated in some of the other samples. A male and a female sample were 


Table 1 


Crossvalidation of Multiple-Regression Weights for a 
Three-Test Battery 


Crossvalidation 
sample 


Unweighted 


Derivation sample 
N validity 


B Cc D 


GIN rO 


Note.—Based on data from Wesman and Bennett (1959). 
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obtained from each of three coeducational institutions; the seventh sample 
was from a women’s college. Weights were crossvalidated only in samples 
of the same sex. Validity coefficients are presented in Table 1. 


In four out of seven instances weighting did not improve validity 
in the derivation sample; in the remaining three instances the increase 
was extremely small. Moreover, there was no systematic tendency for 
weights derived in one sample, say College A females, to increase validity 
in crossvalidation samples (College B females or College C females). The 
failure of the weights to crossvalidate well in this study may be due in 
part to the fact that different subpopulations were used rather than dif- 
ferent random samples from the same subpopulation (e.g., College A 
females). This may also explain the curious result that in some cross- 
validation samples the validity coefficient increased rather than shrinking 
as it should if a new random sample from the same population had been 
drawn. Rather, the different colleges apparently represented different 
populations, in some of which the criterion was more predictable than in 
others. 

Booth (1968) used multiple regression to weight course grades in 
arriving at a final grade in two Naval Aviation Schools: the Aviation Officer 
Candidate School (AOCS) and Flight Preparation School (FP). The 
study was undertaken to determine whether an old set of regression weights 
should be replaced by a new set in order to improve prediction of comple- 
tion vs. noncompletion of the training program. The possibility was also 
considered that subgroups might be used to derive special sets of weights, 
thereby further improving prediction. 

With a sample size large enough to provide reasonably stable regres- 
sion weights, it eh found that the new weights raised the correlation 
between final grade and the criterion from .207 to 268 in AOCS, and 
from .304 to .313 in FP. The use of special sets of weights for different 
subgroups resulted in very slightly higher correlation, but not sufficiently 
higher to justify the difficulty involved in applying them. 


It is interesting to contrast Booth’s study with that of Wesman and 


Bennett (1959) in which gains due to weighting were extremely arte 
Booth compared two sets of multiple regression weights rather than mul- 
tiple regression weights vs. natural weighting, and Booth used eight course 
grades as predictors rather than only three. Both of these factors should 
have worked against the effectiveness of weighting. An important differ- 
ence between the studies, however, is that in Wesman and Bennett's study 
the average intercorrelation between the predictors was moderately high, 
whereas the course grades in AOCS intercorrelated 19 on the average and 
in FP, .34. Also, the validity coefficients were initially lower in Booth’s 


sample. 
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The Weighting of Test Items 


Given the factors which are known to influence the effectiveness of 
weighting, there is little reason to be optimistic about the outcome of 
differentially weighting a large number of dichotomously scored, positively 
intercorrelated test items. Numerous studies have been performed demon- 
strating that item weighting is futile. Some based their conclusions on 
the high correlation of weighted composites; others compared validities 
directly. 

The arrival of the new-type or objective examination in the 1920’s 
was accompanied by claims of objectivity in scoring which would result 
in fairer assignment of course grades. Some opponents of the new tests 
felt that the claim of greater objectivity was unfounded. One such op- 
ponent, Corey (1930), attempted to demonstrate the element of subjectivity 
in new-type examinations, Corey asked six instructors to rate (weight) 
each item of a 73-item test according to “its judged importance for a gen- 
eral knowledge of psychology.” He scored each examination with each 
set of weights and with no weights. Weighted test scores correlated from 
836 to .960 with unweighted scores. Corey established arbitrary cutoffs 
and assigned letter grades to the sets of tests. He concluded that many 
students would receive very different grades depending on which instruc- 
tor’s weights were used to score the test, thus demonstrating the subjectivity 
which lingered in the new-type test. 


ing, two methods of weighting by difficulty, and three random weighting 
methods. With the exception of the two difficulty sets, all sets of weights 
Yet the composites obtained with 
- Odell also used Corey’s proce- 
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Brown formula, and validity coefficients for the weighted and unweighted 
tests are presented in Table 2. 


Table 2 
Reliability and Validity of Weighted 
and Unweighted Tests 


Reliability 
Weighted Unweighted Weighted Unweighted 


Note—Based on data from Guilford, Lovell and Williams (1942). 


Differences in reliability and validity for the weighted and unweighted 
tests were not significant. Guilford et al. explained that the phi coef- 
ficients for these items and the range of the weights were both small. They 
also noted that since the validity coefficient was a part-whole correlation, 
its spuriousness may have obscured real differences. However, an attempt 
to derive an estimate of the correlation of the 100-item test, without the 
spuriousness, did not support this interpretation. It was concluded that 
weighting was not worth the trouble. 

Other studies of item weighting have reached similar conclusions. 
Douglas and Spencer (1923) found weighted and unweighted tests to 
correlate .98, .99, .995, .996, .985, and .991. Holzinger (1923) reported a 
correlation of over .99 for weighted vs. unweighted items of a French 
achievement test. West (1924) found correlations ranging from .987 to 
.997 for weighted vs. unweighted comprehension tests. In addition, he 
reported correlations of 975, .956, .932, .966, 984, and .940 for six of 
the Army Alpha tests, with weighted vs. unweighted scores. Peatman 
(1930), using Clark’s Index of Validity to weight true-false items, found 
over a series of quizzes and a final exam that correlation ranged from 
879 to .970 for the individual tests, and that the correlation for all tests 
combined was .978. Ruch and Meyer (1931) found that weighting on 
the basis of difficulty did not raise validity and perhaps lowered reli- 
ability. Pothoff and Barnett (1932), in a study quite similar to that 
of Odell (1931), found correlations of 965 to .987 between weighted and 
unweighted scores, when weights were based on teachers opinions. Fruch- 
ter (1962) found that unit weighting of the most valid items of the Kelley 
Activity Preference Report provided as valid a predictor of completion of 
first-term enlistment by airmen as did e elute arate score isa gabe 

rt. Finally, Stalnaker (1938), in a study of weig ting essay-type € - 
viata constetally found correlations of .98 and .99 between weighted 
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and unweighted versions of a number of examinations of the College En- 
trance Examination Board. The effectiveness of fixed item weighting 
seems to have long since been disproven. 


There remain at least two hopes for effective differential weighting 
of item scores. The first is Allan Birnbaum’s work on a three-parameter 
logistic model, reported in Lord and Novick (1968, pp. 395-479). Birn- 
baum’s scoring procedure applies differential weights not only to items 
but also to various ability levels. The assignment of small weights to 
items difficult for a given ability level, and larger weights to the easier 
items, tends to nullify the effects of wild guessing by the least able 
examinees. Lord (1968) used the model with Scholastic Aptitude Test 
Verbal scores and reported encouraging results, although he noted some 
possible limitations of the model. 


A second potentially effective item-weighting technique is Cleary’s 
(1966) multiple-regression model, which allows individual differences to 
emerge empirically. The model requires that each person be assigned a 
different set of regression weights, and it has effectively reduced the 
variance of errors of prediction. Unlike Birmbaum’s procedure, this 
moderated linear regression approach can operate with corrected-for-chance 
item scores. A criterion variable is required, however, and it remains 


to be seen whether total scores on the rest of the test can be used effec- 
tively for this. 


Also to be noted is Samejima’s (1969) application of a graded-response 
model to multiple-choice situations in an attempt to estimate latent ability. 


The Weighting of Item Responses 


The list of empirical studies begins with the affirmation by Strong 
it vana resulted in less differentiation be- 
a series of experiments, Dunlap and his asso- 
ciates claimed to show that unit weights could be E for the 
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larger weights with only a small loss in accuracy (Dunlap, 1940; Harper 
& Dunlap, 1942; Kogan & Gehlmann, 1942; Lester & Traxler, 1942; Peter- 
son & Dunlap, 1941). In each study the strategy was to score a sample of 
blanks with both unit and regular weights, and to derive a multiple-re- 
gression equation for predicting the weighted scores from unit-weighted 
scores. The equation was then crossvalidated and the correlation between 
predicted and obtained scores was computed. This correlation was typically 
in the middle to high .90’s. 

Since the SVIB is used as the basis of vocational counseling, an im- 
portant question is, “To what extent is the letter-grade designation upon 
which counseling is based affected by the change in scoring procedure?” 
In the typical study, 3.5% of the examinees could be expected to obtain 
a grade of B under unit weighting, instead of the B+ which they would 
have obtained under the conventional scoring procedure. This critical shift 
could result in a failure on the part of the counselor to recommend the 
occupation. 

Strong (1945), in an extensive review of this research, claimed that 
not only the highest scores on the blank were to be stressed, but also the 
entire pattern, and that additional changes in the scores might noticeably 
affect this pattern. Even in 1964, Strong still maintained that unit scor- 
ing reduced validity. However, under the considerable pressure put forth 
by others, the SVIB finally acquired unit weights (Strong, Campbell, Berdie 
& Clark, 1964). AERA 3 

Similar findings were reported in research wi e reuter Per- 
sonality Inventory ami 1938; Kempfer, 1944; McClelland, 1944, 1947). 
Here also, only a small loss of accuracy was suffered when diminished 
weights were used. 

Until fairly recently, the possibility of differentially weighting the 
incorrect options of an a-priori-right-answer item was not considered $ 
the literature. The closest approximation to this consisted of formula 
scoring, in which a fixed proportion of the incorrect responses Is subtracted 
from the number of correct responses as a correction for guessing. Con- 
ventional formula scoring is equivalent to assigning a weight of 1.00 t 
the correct option, a weight of zero to the option “omit,” and a bee o! 
—I1/(k - 1) to the incorrect options, where k is the number of options 
other than “omit.” Some eves preferred Shes a a ast 
for the incorrect options empirically via some nique s i 
sion (e.g., see Brinkley, 1924; Dailey, 1947; Staffelbach, 1930; oa a 
1919), However, formula scoring, regardless of how the formula is de- 
rived, is not differential weighting of nel Sa! es 

The first step in the direction of differential weighting ot in 
responses was wer by Nedelsky (1954). He used the rian of uke 
to identify distractors which were so grossly incorrect as to be attractive 
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only to an F student. Some items had no such response options; others 
had more than one. From a pool of 651 students, Nedelsky selected all 
who had earned a grade of D or F on a comprehensive examination, and 
a sample of those who had earned an A, B, or C, resulting in a sample 
of 306 students. Each of these students received three scores: (a) R, the 
number of items answered correctly; (b) F, the number of F options chos- 
en, and (c) a composite score computed from the formula R—F/f, where 
f was the average number of F options per item. Nedelsky found that for 
students scoring in the D or F range with conventional scoring, the F 
score was the most reliable score, and that throughout the distribution, 
the composite score was considerably more reliable than the conventional 
score. 

If Nedelsky’s study may be considered a significant first step in the 
direction of differential weighting of options, the later study of Davis 
and Fifer (1959) may be considered a significant second step. These 
authors noted that conventional rights-only scores, as well as formula 
scores, do not permit differentiation among examinees with respect to the 
type of distractors they select. The student who consistently chooses in- 
correct responses which are very nearly correct receives the same penalty 
as the student who chooses the same number of incorrect responses, but 
whose choices reflect very little information at all. This concern with the 
assessment of partial knowledge also underlies many of the response- 
determined scoring schemes discussed in a later section. 

From a pool of 300 arithmetic-reasoning items, two parallel tests 
of 45 items each were constructed. From the responses of a sample of 
examinees, response-option weights were derived empirically via the cor- 
relation between marl ing the option and the total score on all 90 items. 
(Davis, 1959, discusses the weighting scheme in detail.) Since a quantita- 
tive criterion was available, these weights seem somewhat less appropri- ` 
ate than Guttman’s (1941) criterion weights, though considerably easier 
to score. These empirical weights, however, were modified in terms of 


a priori ratings given to the options by two mathematicians. Two studies 
were made. 


scores. This difference is highly significant and is equivalent to lengthen- 


ing the test from 45 to 67 items. Davis and Fifer pointed out that this 
increase was not attributable to th 
options. 


The second study focussed on test validity. Two criteria were used: 
teacher’s ratings and the score on a free-response form ‘of the same 
test. Both forms of the test were taken by 251 subjects and it was found 
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that validity was no different when conventional scores were replaced by 
weighted scores. The authors concluded that the variance introduced into 
the total score had increased the proportion of “true” variance, but that 
the new variance had the same concurrent validity as the original. 

A slight exception may be taken to this conclusion, however, since 
the increased reliability of the test should have resulted in a concomi- 
tant increase in the validity, just as increasing the length of the test would 
have (theoretically). Before meaningful generalizations can be made 
concerning the nature of the variance which is added to the test score by 
option weighting, it is necessary to consider more carefully the composi- 
tion of total-test-score variance. Aiken (1967) presented formulas for 
the maximum total variance of the test, although some of his simplifying 
assumptions may not be completely justified. 

Jacobs and Vanderventer (1968) undertook a rather novel study in 
which the notion of facet analysis was used as the basis of a procedure for 
systematically ordering the distractors on the Coloured Progressive Matrices 
Test as to degree of correctness. This a priori method of keying the re- 
sponse alternatives was shown to have a moderate degree of test-retest 
reliability, and concurrent and predictive validity. 

Hambleton, Roberts and Traub (1970) compared two a priori meth- 
ods of obtaining option weights, one based on the procedure of Jacobs and 
Vanderventer (1968), and one based on the average rank assigned to op- 
tions by judges who ranked all options for correctness. Both sets of weights 
tended to increase estimated predictive validity of a midterm examination 
in educational measurement, and reliability was slightly increased with 
the first set of weights, although none of the increases attained statistical 
significance. oa 

Hendrickson (1970) reported a large-scale reliability-maximizing 
study in which option weights were secured using Guttman’s (1941) tech- 
nique for maximizing internal consistency. The study was based on re- 
sponses to the Scholastic Aptitude Test of 5,000 randomly selected males 
and 5,000 randomly selected females. These groups were further divided 

tion weights were obtained for each 


randomly into groups of 2,500, and op ‘ 
group pe! each of the four subtests of the SAT separately. The weights 


were derived via an iterative procedure which began by assigning to a 
response category a weight equal to the mean total score on the remain- 
ing items of the subtest obtained by the examinees who marked that re- 


sponse category. 


All results were based on doubly crossvalidated internal consistency 


reliabilities, Hendrickson found that effective increases in test length 
estimated from the Spearman-Brown prophecy formula ranged from a 
high of 78.25% to a low of 19.09%, with an average increase of 49%. 
Her results for correlations among subtests were less clear. Weighting in- 
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creased intercorrelation of the two verbal subtests, but decreased the in- 
tercorrelations of the verbal and mathematics subtests. Hendrickson hypo- 
thesized that the different factorial content of the two mathematics sub- 
tests accounted for the observed drop in the correlations. Though she did 
not study the effect of weighting on validity directly, the rather stable 
scoring weights that she obtained should make validity studies on this 
form of the SAT feasible. 


Response-Determined Scoring 


All of the weighting techniques discussed so far have the character- 
istic of a constant multiplicative weight being directly associated with a 
variable (a test or a test item) or with one of a small number of re- 
sponse categories (usually a response option to a single item). Once the 
weights have been determined, the examince’s score on a test item is com- 
pletely determined by the response option he selects. There are, how- 
ever, alternative strategies for obtaining item scores. 


Since the beginning of the objective-test movement, there has been 
concern over the effects of guessing on the reliability and validity of mul- 
tiple-choice and true-false tests. Although removing the effects of guess- 
ing may not always be desirable (Fray, 1969), most testers have attempted 
to correct for guessing. Many have used formula-scoring methods which 
entail the unrealistic assumption that if a subject does not know the 
answer to a question, he guesses randomly among the response options. 
The differential popularity of incorrect options belies this assumption, how- 
ever. Moreover, Sax and Collet (1968) demonstrated that instructions 
not to guess can be as effective a remedy as formula scoring, and Slakter 
(1968) showed that the instructions which accompany formula-scored 
tests tend to penalize low risk takers. 

Another theoretical weakness of formula scoring is that it denies 
the existence of partial information and misinformation. Differential 
preighting of distractors represents one scheme for identifying partial in- 
ormation and misinformation, although empirical methods for deriving 
the weights do not ensure that the more heavily weighted options are 
necessarily more correct. Thus, although the correctness of an option is 
in some instances defined operationally, it is the response options them- 


selves which bear the burden of differentiati i ith 
reipi to partial KUIRO ifferentiating among the examinees wi 


A different approach to item scoring encompasses both of these goals. 
In each of the Studies discussed in this section, the examinee’s response 
to an item consists of more than simply selecting the correct option. In- 
deed, the concept of a “response” has been considerably broadened, and it 


is the characteristics of the response, rather than those of the item or the 
option, which determine the item score. 


694 


WANG AND STANLEY DIFFERENTIAL WEIGHTING 


Dressel and Schmid (1953) instructed subjects to cross out options 
of five-option multiple-choice items until they were certain that the correct 
response had been crossed out. Each incorrect mark was scored as — 1⁄4 
point. This combination of response method and scoring system yielded a 
reliability of .67 as compared with a reliability of .70 obtained with the 
conventional scoring procedure. 

Coombs, Milholland and Womer (1956) performed the complemen- 
tary experiment in which subjects were instructed to eliminate the incor- 
rect alternatives, taking care not to mark the correct alternative. If r 
out of k alternatives were eliminated, the score was + r if the correct 
response was not eliminated and r - k if it was. There was evidence to 
indicate that this method of scoring and responding resulted in a gain of 
reliability equivalent to lengthening the test by 20%. 

These elimination response techniques are reminiscent of the Troyer- 
Angell punchboard, a device invented two decades ago, on which the 
subject punched out his choices until a red dot appeared in the hole, in- 
dicating that the correct option had been chosen. Scoring was based on 
the number of punches needed to reveal the dot. Jones and Sawyer (1949) 
reported that the instant feedback characteristic of the device resulted in 
higher achievement by students who used it for an entire semester. 

Willey (1960) suggested that information about the examinee’s ability 
to eliminate distractors could be incorporated into the test score by re- 
quiring him to choose the correct alternative and to eliminate two defi- 
nitely incorrect alternatives in a five-option multiple-choice item, How- 
ever, Bernhardson (1966, 1967) showed that this three-decision procedure 
raised the expected chance score from 20% to 46.7% of the maximum 
score, and that it resulted in no increase in predictive validity of the test 
score. Aiken (1968) derived the chance scores for this and other rear- 
rangement or ranking responses, in combination with various scoring 
techniques, and also presented (Aiken, 1970) a simplified scoring proce- 
dure for such responses. 

Another method of assessing partial information entails the use of 
confidence weights. This strategy resembles the use of “importance 
weights assigned by employees to scales of job satisfaction (Waters, 1969). 
With respect to tests, the procedure has its historical antecedents in the 
confidence weighting of true-false tests by Hevner (1932) and Soderquist 
(1936). More recently, Ebel (1965) found that confidence weighting 
could improve true-false test reliability. W: K 

Dressel and Schmid (1953) studied confidence weighting with mul- 


tiple-choice items. They had subjects assign a confidence weight from one 


to four to the option they thought was correct. The weight was scored 
positive if the choice was correct and negative if incorrect. They reported 


a reliability of .73 for this method as opposed to .70 for the conventional 
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method. Michael (1968) found that confidence weighting increased esti- 
mated reliability from .764 to 840. Hopkins, Hakstian and Hopkins 
(1970) used a very similar confidence weighting system and obtained a 
small increase in test reliability, but a decrease in validity. They con- 
cluded that the added reliable variance observed with confidence-weighted 
scores may reflect irrelevant response styles of the examinees. Ebel (1965) 
suggested that the most capable students may not be the ones most able 
to discriminate which responses should be given confidently. 


The term “confidence weighting” has also been applied to a response 
method which requires the student to assign subjective probabilities to 
each option of a multiple-choice item (Shuford & Massengill, 1967). It 
is assumed that the examinee’s state of knowledge about a multiple-choice 
item may be expressed as a distribution of subjective probabilities which 
sum to 1.00 over the options. Since such distributions are not “wired in” 
to the cognitive system, it must be assumed that the examinees are cap- 
able of converting their degrees of confidence in the various options into 
such a distribution. Since the probabilities are subjective, they must be 
interpreted as measures of relative confidence only. There is no guarantee 
that identical distributions for two examinees represent the same absolute 
degrees of confidence in the options. 


Under what has been termed “admissible probability measurement,” 
the scoring system is so devised that the examinee can maximize his ex- 
pected item score only if he reports as accurately as possible the distribu- 
tion of his subjective probabilities over the options. Shuford, Albert and 
Massengill (1966) called such scoring systems “reproducing scoring sys- 
tems” (RSS) and discussed their mathematical characteristics. Shuford 
et al. showed that for an item with more than two options, a true RSS 
must depend on both the probability assigned to the correct option and 
the probabilities assigned to the remaining options. However, they 
presented a logarithmic scoring system which depends only on the proba- 
bility assigned to the correct option and which has the property of an 
RSS over most of its range. Most RSS’s are symmetrical, i.e., the item 
score does not change when the probabilities assigned to the incorrect op- 
tions are permuted. There is no differential penalty for assigning high 
aes (probability) to one incorrect option over another. 

uford and Massengill, in a series of technical reports (Massengill 
& Shuford, 1967; Shuford, 1967; Shuford & Massengill, 1967), expressed 
great enthusiasm and optimism concerning the potential of admissible 
oe measurement for eliminating the effects of guessing on multiple- 
1968). tests, although they presented little direct evidence for it (Ebel, 
: ). They demonstrated mathematically that the elimination of guess- 
ing can theoretically provide quite substantial gains in both reliability and 
validity. These gains, however, were determined with reference to a 
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theoretical maximum which is rarely encountered in practice. Moreover, 
the correction-for- formula, despite its theoretical weaknesses, does 
help to eliminate effects of guessing. (See Lord, 1963.) 

Only two empirical studies used reproducing scoring systems with 
probabilistic responses. Rippey (1968) used two scoring systems sug- 
gested by Shuford et al. (1966), the logarithmic system and a system 
which incorporated the probabilities assigned to the incorrect options as 
well. He found that these scoring systems produced erratic changes in the 
reliability of the test. Hambleton, Roberts and Traub (1970) found that 
the logarithmic formula resulted in reduced reliability and increased vali- 
dity, aithough none of the differences were statistically significant. 

It should be noted carefully that the use of an RSS is as important 
a component of admissible probability measurement as is the confidence- 
weighting response itself. Given the assumption that the examinee can 
evaluate his subjective probabilities, and given a scoring system which 
depends only on the weight assigned to the correct response (6g, see 
Stokes, 1966), the examinee’s optimum strategy, in terms of maximum ex- 
pected score, is to ignore the instructions and place 100% confidence in 
the answer which has the highest probability of being correct (Cech & 
Ruchinski, 1970; Williams & Millman, 1970). The student’s willingness 
to comply with the instruction to report confidence accurately thus works 
to his disadvantage unless an RSS is used. 

Although the response technique recommended by Shuford and Mas- 
sengill (1967) requires the student to assign probabilities directly to the 
options, admissible probability measurement can be achieved with other 


elimination methods of Coombs et al. (1956) and Dressel and Schmid 
(1953), and the conventional answering method under formula scoring. 
De Finetti’s approach emphasized the fact that any response method and 
any scoring method jointly determine an optimum response strategy for 
the examinee who can assess his subjective probabilities. For example, 
under conventional formula scoring, the examinee should always guess 
unless every option is equally attractive. 

It is a paradox that although some response methods seem, superfi- 
cially, to be simpler than assigning probabilities directly to the options, 
the optimum response strategy may take the form of a very complicated 
rule. For example, assume that the student is to respond by crossing out 
the incorrect options as in the study by Coombs et al. (1956). If r is the 
number of options, and k is the number of options which have been crossed 
out, the score may be determined by the formula 1/(r - k) and made 
negative if the correct option is crossed out, positive otherwise. How 
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many options should the student cross out, given his probability distribu- 
tion? Let him rank the options such that p, is the largest probability and 
pn is the probability assigned to the hth option, h = 1,..., r. The rule 
for maximizing the expected score on this item may then be stated: “Cross 
out alternatives until the probability px of the r - h alternatives already 
crossed out plus that of the next one p,*, when multiplied by the number 
h of those still left, does not attain .5” (de Finetti, 1965, p. 98). Thus, al- 
though the response of crossing out alternatives does not seem demand- 
ing, the strategy which allows the subject to be consistent and to maximize 
his expected score is extraordinarily complex. 


The derivation of optimum response strategies in multiple-choice test- 
ing represents an application of mathematical decision theory which under- 
scores the decision processes inherent in such tests. The success of testing 
procedures which attempt to control the decision process will be critically 
dependent on the ability of subjects to effectively use optimum strategies. 
It is not certain that all students are equally capable of learning to use 
such strategies. Also raised is the problem of differential risk-taking propen- 
sities of different examinees, Despite the fact that risk taking in the long 
run must reduce the expected score, the score on a single test can be 
altered significantly by a lucky guess. These aspects of subjective proba- 


bility measurement have been considered in some detail by Winkler (1967a, 
1967b, 1967c). 


Although most reproducing scoring systems are symmetrical, this is 
not a necessary characteristic of an RSS. Thus it should be possible to 
incorporate differentially weighted distractors into an asymmetric RSS. 
It is likely, though, that the examinee’s optimum strategy would then be- 
come considerably more complex. Empirical work on the use of ad- 
missible probability measurement and differential option weighting, both 
separately and in conjunction, is undoubtedly forthcoming since so much 
theoretical interest in these proposals has been aroused. 


Summary 


When a number of measures are to be combined, it is sometimes desir- 


able to weight the measures differentially, either with fixed weights which 
are constant over persons, or with weights which are determined by charac- 
teristics of the person’s response, In this paper the literature on a priori 
and empirical weighting of test items and test-item options is reviewed. 

The best known and most widely used method of obtaining weights 
for variables such as tests or test items is multiple (linear) regression. 
Other methods allow one to derive nominal weights which equalize the ef- 
fective weights of the variables, i.e., their individual contributions to the 
variance of the composite, or which equalize the correlation of each vari- 
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able with the composite, or which maximize composite reliability. Other 
weighting schemes include weighting by the reciprocal of the standard 
deviation, weighting (tests) by length or difficulty, and weighting by 
the validity coefficient of the variable. 


The effectiveness of weighting depends on the number of measures to 
be combined, their intercorrelations, and certain characteristics of the 
weights. Weighting is most effective when there are but a few relatively 
independent variables in the composite. With a large number of positively 
correlated variables (such as test items), the correlation between two 
randomly weighted composites rapidly approaches unity. 


Some weighting methods provide weights for response categories such 
as those in the items of personality, attitude, and interest tests, in which 
there are no “correct” response options. A familiar example is the method 
used by E. K. Strong, Jr., to secure option weights for the Strong Voca- 
tional Interest Blank. Other methods, such as those of Guttman (1941), 
utilize a quantitative criterion to assign option weights. 


Empirical studies of weighting, popular in the 1920’s and 1930's, 
documented what the analytical papers had predicted. Weighting of test 
items was shown repeatedly to be ineffective, or so slightly effective as 
to be impractical. In interest and personality tests, elaborate option 
weights have slowly given way to unit weights with little loss in validity. 
Differential weighting of distractors in aptitude and achievement tests has 
recently been studied with interest, however. 


If the examinee’s response is not restricted to simply choosing the 
option which is thought to be correct, then the concept of “response” may 
be more broadly defined, and item scores may be based on characteristics 
of the response rather than on characteristics of the items or the options. 
Possible response methods and scoring systems are limited only by the 
imagination of testers, although some response-scoring techniques have the 
desirable characteristic of allowing the subject to maximize his expected 
score only if he reveals as accurately as possible his actual state of knowl- 
edge. The workability of some of the suggested methods is critically 
dependent on the ability of students to utilize complicated response 


strategies. 
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day. High frequencies of question use teachers were also found in recent 
investigations: 10 primary-grade teachers asked an of 348 questions 
each during a school day (Floyd, H 

an average of 180 questions each in a science lesson (Moyer, 1965); and l4 


fifth-grade teachers asked an average of 68 each in a 30-minute 
social studies lesson (Schreiber, 1967). students are exposed 
to many questions in their textbooks and on examinations. 

Granting the in teaching, still do 


remain unrealized. The purpose of this 

research knowledge in this area and to suggest some contribu 
oo be made by researchers who are interested in improving the quality of 
classroom teaching. Although textbook and examination questions un- 
doubtedly make a contribution to the learning process, I will limit my 
doubted Y the most part to studies of spoken questions which occur during 
regular classroom teaching, particularly classroom discussions. 


Tha author wishes to thank Dr. Walter R. Borg for his helpful suggestions and 
criticism during the writing of this paper. 
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The Classification of Questions by Type 


Many researchers have attempted to describe the types of question 
asked by teachers. To quantify their descriptions, some have found it help- 
ful to develop sets of categories into which teachers’ questions can be class- 
ified. At least 11 classification systems have been proposed in recent years 
(Adams, 1964; Aschner, 1961; Bloom, 1956; Carner, 1963; Clements, 1964; 
Gallagher, 1965; Guszak, 1967; Moyer, 1965; Pate & Bremer, 1967; Sanders, 
1966; Schreiber, 1967). 

Several systems, such as Bloom’s, Gallagher’s, and Carner’s, consist of 
a limited number of general categories which can be used to classify ques- 
tions irrespective of context. This feature enables the researcher to investigate 
issues such as the different types of question emphasized in various school 
curricula (Pfeiffer & Davis, 1965) or in traditional or new curricula (Sloan 
& Pate, 1966). However, these systems are of limited utility if the re- 


searcher is interested in more detailed descriptions of questions asked in a 
specific context. 


For detailed descriptions a classification system developed for a specific 
curriculum is preferable. One such system (Clements, 1964) was designed to 
classify the questions asked by art teachers as they talked with students about 
their artwork. For example, the “suggestion-order” category includes ques- 
tions such as: “Why don’t you make the hands larger?”; “Why not put 
some red over here?”; “Why don’t you use freer lines?” This type of ques- 
tion, which occurs frequently in art classes, is not adequately described by 
any of the categories in the more general systems. 


Guszak’s Reading-Comprehension Question-Response Inventory is a 
specific classification system designed for the analysis of question that teach- 
ers ask elementary school reading groups. The specificity of the categories 
is typified by the “recognition question” category, which includes questions 
requiring students to locate information from the reading context (e.g. 

Find what Little Red Ridinghood says to the wolf.”) In Schreiber’s system 
for classifying social science questions, there are also a number of fairly 
curriculum-specific categories, such as Use of Globes (e.g., “Will you find 
Greenland on the globe?”) and Stating of Moral Judgments (e.g, “Do you 
think it is right to have censorship of the news?”), 

Most of the question-classificati 
of categories based on the type of cogniti 
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systems are shown in Table 1. I have organized the categories to show 
similarities between the systems. It appears that Bloom’s Taxonomy best 
represents the commonalities that exist among the systems. 


A weakness of the cognitive-process approach to question classification 
is that these processes are inferential constructs. Therefore, they cannot be 
observed directly. Bloom (1956) acknowledged this difficulty in his state- 
ment that it is not always possible to know whether a student answered 
a particular question by using a high-level cognitive process, such as analysis 
or synthesis, or by using the relatively low-level process of knowledge recall. 
The question, “What are some similarities between the Greek and American 
forms of democracy?” probably stimulates critical thinking in some students. 
However, this question may only elicit rote recall if students answer by 
recalling similarities they have read in a textbook. 

To deal with this problem, the researcher can control the lesson material 
on which the teacher bases his questions. For example, he might have a 
sample of teachers give the same reading assignment to their students. 
Preferably the assignment would be on a subject new to the students, The 
teachers would then ask discussion questions on this assignment and the 
questions could be classified as recall or higher-cognitive depending on 
whether the answer was given directly in the assignment. Furthermore, if 
the researcher is studying differences between teachers in question-asking 
skill or is studying improvement in this skill as a result of a training program, 
the use of a constant lesson topic makes it possible to attribute variance in 
question-asking to the teachers rather than to differences in the lessons. 
With two exceptions (Gall, Dunning, Galassi & Banks, 1970; Hunkins, 


1966, 1967), the studies reviewed here did not make use of this important 
control technique. > 


question types which are treated scantily, if at all, in existing taxonomies: 
(a) questions which cue students to improve on an intially weak response to 
a question (“Can you tell me a little more?”; “What do you mean by 
that?”); (b) questions which create a discussion atmosphere (“Billy, do you 

wit ’); (c) questions which stimulate students’ sense 
of curiosity and inquiry (“What would you like to know about this manu- 


Another limitation 
designed primarily to in 
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ually use in the classroom, not the types of question which teachers should 
use. Researchers have shown relatively little interest in identifying effective 
types of questions. There have been only a scattering of opinion articles, 
and these have emphasized the formal characteristics of a “good” question, 
e.g., clarity of phrasing, rather than the educational purposes which good 
questions serve. 

Much of what has been learned about the merits and pitfalls of des- 
criptive systems should provide guidance for identifying effective question 
types. For example, it would seem preferable to identify questions which 
are effective for a specific curriculum and classroom setting rather than 
to search for general question types. Research might be done to identify 
effective question types in mathematics tutoring, introducing concepts in 
the science curriculum, discussing controversial issues, role playing in social 
studies, etc. These specific question types, as compared to the categories of a 
general classification system such as Bloom’s Taxonomy, would have two 
advantages: they would provide a more precise and possibly clearer descrip- 
tion of what constitutes effective questioning in a particular teaching situ- 
ation; and they would be more useful than general question types in 
training teachers to improve their classroom instruction. 

Prior to defining effective types of question, the researcher needs to 
identify valued educational objectives in a specific setting. Once objectives 
are identified, the task of constructing questions which enable the student 
to reach each objective can be started. It would help in this task if groups 
of expert teachers and curriculum developers composed questions for each 
objective and then selected the most effective questions. In this type of 
research, effective question types would be defined in terms of whether or 
not they enabled the student to achieve desired educational objectives. 


Another task for the researcher is to consider whether there are effec- 
tive question sequences. Should teachers start a discussion by asking recall 
questions to test students’ knowledge of facts and then ask higher-cognitive 
questions that require manipulation of these facts? This was the approach 
taken by Taba (1964, 1966), who attempted to identify questioning strate- 
gies that stimulate students to reflect on curriculum materials on an in- 
creasingly abstract level. In Shaver’s model of Socratic teaching (1964), 
another type of question sequence was proposed: the teacher asks the student 
for a statement of his position on an issue, then asks appropriate follow-up 
questions to probe the student’s stated position. 

Further research on teachers’ “follow-up” questions is needed. Con- 
sider a typical situation which occurs in classroom discussions. The teacher 
asks a question such as, “What do you think can be done to solve the 
problem of air pollution?”; this would be classified as a higher-cognitive 
question in most question-classification systems. A student answers, “Make 
sure all cars and trucks have smog control devices.” Did the student really 
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have to think to answer this question? He may have considered the problem 
in depth and decided that smog control is the best solution. However, it is 
more likely that the student is repeating a solution he has heard or read 
about. To really test the student's ability to think about the problem and to 
stimulate the development of his thinking processes, the teacher should 
probably ask follow-up questions such as, “How would that solve the prob- 
lem?”; “Isn't that being done already?”; “Is that a better solution than 
converting to electric or steam-powered cars?” We know very little about 
teachers’ use of such questions in discussions. In fact, most question-class- 
ification systems do not take them into account since the systems are not 
concerned with question sequence. However, I suggest the hypothesis that 
follow-up questioning of the student's initial response has substantial impact 
on student learning in classroom teaching situations. 


Studies of Teachers’ Questioning Practices 


Educators generally agree that teachers should emphasize the develop- 
ment of students’ skill in critical thinking rather than in learning and 
recalling facts (Aschner, 1961; Carner, 1963; Hunkins, 1966). Yet research 


spanning more than a half-century indicates that teachers’ questions have 
emphasized facts. 


Probably the first serious study of this issue was done by Stevens 
(1912). She found that, for a sample of high-school classes varying in grade 
level and subject area, two-thirds of the teachers’ questions required direct 
recall of textbook information. Two decades later, Haynes (1935) found that 
TT% of teachers’ questions in sixth-grade history classes called for factual 
answers; only 17% were judged to require students to think. In Corey’s 
study (1940), three judges classified all questions asked by teachers in a 
one-week period in a laboratory high school. The judges classified 71% of 


the questions as factual and 29% as those which required a thoughtful 
answer. 


Studies conducted in the last several years indicated that teachers’ 
questioning practices are essentially unchanged. Floyd (1960) classified 
the questions of a sample of 40 “best” teachers in elementary classrooms. 
Specific facts were called for in 42% of the questions. I summed Floyd's 
percentages of questions in categories which appear to have required 
thoughtful responses from students; these accounted for about 20% of the 
questions asked. In two other studies conducted at the elementary-school 
level (Guszak, 1967; Schreiber, 1967), similar percentages of fact and 
thought questions were asked. At the high-school level, Gallagher (1965) 
and Davis and Tinsley (1967) classified the questions asked by teachers 
of gifted students and by student teachers. More than half of the questions 
asked by both groups were judged to test students’ recall of facts. 
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The findings in studies on teachers’ questioning practices are fairl 
consistent (though in some instances there are methodological flaws 
as failure to report inter-rater reliability in classification of questions and 
lack of clarity in the definition of question categories). It is reasonable to 
conclude that in a half-century there has been no essential change in the 
types of question which teachers emphasize in the classroom. About 60% 
of teachers’ questions require students to recall facts; about 20% require 
students to think; and the remaining 20% are procedural. 


Why has the primary objective of American education, as revealed by 
an analysis of teachers’ questions, been the learning and recall of facts? 
One explanation is that although higher-cognitive objectives are valued in 
American education, teachers need to ask many fact questions to bring out 
the data which students require to answer thought questions. Even though 
this explanation has merit, it can be argued that instruction in facts is best 
accomplished by techniques (such as programmed instruction) that do not 
require teacher intervention. The teacher's time is better spent in developing 
students’ thinking and communication skills during discussions after the 
students have demonstrated an acceptable level of knowledge on a written 
test. 

Another explanation of the research findings is that although educators 
have for a long time advocated the pursuit of objectives such as critical 
thinking and problem solving, only recently were these objectives incor- 
porated systematically into new curricula, The relationship between cur- 
riculum change and teachers’ questioning practices is illustrated in a recent 
study comparing teachers in the School Mathematics Study Group (SMSG) 
with teachers in a traditional mathematics program (Sloan & Pate, 1966). 
The researchers hypothesized that the two groups would differ in their 
patterns of questioning since the SMSG program emphasizes the objectives 
of inquiry and discovery. They found that, compared to the traditional math 
teachers, the “new math” teachers asked significantly fewer recall questions 
and significantly more comprehension and analysis questions. 

Sloan and Pate’s study suggested the interesting hypothesis that teach- 
ers’ use of fact and higher-cognitive questions is dependent on the type of 
curriculum materials available to them. This hypothesis could be easily 
tested by asking teachers to lead discussions based on different lesson topics 
assigned to students: for example, a poem, a traditional textbook chapter, 
a newspaper editorial, a film. On the basis of my own preliminary research 
findings, I hypothesize that teachers ask more higher-cognitive questions 
about primary sources, e.g., poems and newspaper editorials, than about 
secondary sources (most school textbooks). 

Still another reason why teachers have emphasized fact questions over 
a half-century, as indicated in research findings, is the lack of effective 
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teacher training programs. In their study of questions in mathematics teach- 
ing, Sloan and Pate (1966, p. 166) observed: 


Although the School Mathematics Study Group teachers’ use of 
questions evidenced their awareness of the processes of inquiry and 
discovery, these processes had not been fully implemented, as shown 
by the fact that these teachers used so few synthesis and opinion 
questions that the pupils were denied the opportunity to develop 
inferences from available evidence. ' 


Therefore, Sloan and Pate advocated training teachers in effective ques- 
tioning practices so the objectives of the “new math” can be realized. The 
issue of teacher training in questioning skills is discussed later in this paper. 


Effect of Teachers’ Questions on Student Behavior 


Teachers’ questions are of little value unless they have an impact 
on student behavior. Yet very few researchers have explored the relation- 
ship between teachers’ questions and student outcomes. 


The most important work in this area to date is the research by Hun- 
kins (1967, 1968). The purpose of his research was to determine whether 
the variable of question type bears any relationship to student achievement. 
Two experimental groups of sixth-grade students worked daily for a month 
on sets of questions which were keyed to a social studies text. In one group 
the questions stressed knowledge; in the other, analysis and evaluation 
questions were stressed. Question types were defined in terms of Bloom’s 
Taxonomy. Hunkins found that the analysis-evaluation group earned a 
significantly higher score on a specially constructed post-training test than 
did students who answered questions that stressed knowledge. The perform- 
ance of the two groups was also compared on the six parts of the test which 
corresponded to the six main types of question in Bloom’s Taxonomy: 
the analysis-evaluation group of students did not differ from the comparison 
group in achievement on subtests containing knowledge, comprehension, 
analysis, and synthesis questions; they scored significantly higher on the 
subtests containing application and evaluation questions. 


-p Before the implications of these findings are considered, some possible 
limitations of Hunkin’s research design should be noted. First, whereas the 
daily sets of questions required students to write out their answers, the 
students responded to multiple-choice questions on the post-training test. 
Therefore, one may question whether the achievement test provided an 
adequate comparison of the effectiveness of the two experimental conditions. 


Second it seems a distortion of Bloom’s Taxonomy to put the question types 


into a multiple-choice format since so i i 
me types, such as evaluation questions 
do not really have a “ pee Sas s 


3 
correct” answer. In other words, practice in answering 
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certain types of questions may affect the quality of students’ responses 
rather than their correctness. Third, students monitored their own responses 
using answer sheets provided with the daily sets of questions. Teacher moni- 
toring of at least some of the students’ responses might have enhanced the 
differences found between the experimental conditions. 

In view of these methodological limitations, the Hunkins’ findings 
should be viewed as only suggestive. It seems to be a reasonable hypothesis 
for further investigation, however, that if a group of students is exposed to 
certain types of question and if their responses are monitored to improve 
their quality (rather than correctness), then they will be able to answer 
similar types of question better than a group of students who have not had 
this exposure. 

In testing this hypothesis, the researcher is confronted with the prob- 
lem of defining qualitative differences in student responses. This is one of 
the important unsolved problems in the study of teachers’ questioning 
practices. Although much is known about higher-cognitive questions and 
their classification, little is known about what constitutes good answers to 
these questions. It seems reasonable to state, though, that responses to fact 
questions can be evaluated by the simple criterion of correctness, but 
responses to higher-cognitive questions require several criteria to measure 
their quality. On the basis of exploratory work on the problem I suggest 
these criteria as possibilities: (a) complexity of the response; (b) use of 
data to justify or defend the response; (c) plausibility of the response; (d) 
originality of the response; (e) clarity of the phrasing; and (f) the extent 
to which the response is directed at the question actually asked. It would 
seem reasonable to expect at least a moderate correlation between length 
of the response and its quality, particularly as judged by criteria (a) and 
(b). Dealing with a related problem, Corey and Fahey (1940) obtained 
a correlation of +.50 between judges’ ratings of the “mental complexity” 
of student questions and number of words in the question. 


Students’ Questions 


Some educators contend that our attention should be focused on ques- 
tions asked by students rather than on teachers’ questions (Carner, 1963; 
Wellington & Wellington, 1962). Certainly, it seems a worthwhile educa- 
tional objective to increase the frequency and quality of students’ questions 
in the context of classroom interaction. However, research findings con- 
sistently show that students have only a very limited opportunity to raise 
questions. 

Houston (1938) observed 11 junior-high-school classes and found that 
an average of less than one question per class period was student-initiated, 
Corey (1940) recorded all talk in six junior-high and high-school class- 
rooms for a period of one week. The ratio of student questions to total 
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questions varied considerably between classes: in two English classes, 
students accounted for 1% of the questions asked; seventh-grade and ninth- 
grade science students asked 17% and 11% of the questions respectively. 
At the primary grade level, Floyd (1960) found that student questions were 
3.75%, 5.14%, and 3.64% of the total number of questions asked during a 
taped class session for samples of first- second- and third-grade classrooms 
respectively. A low incidence of student questions was also reported for high- 
school English classes (Johns, 1968) and for social studies classes at the 
elementary-school (Dodl, 1966) and senior-high-school levels (Bellack, 
Kliebard, Hyman & Smith, Jr., 1966). 

In investigating student questions in the classroom, researchers need 

to undertake several important tasks. First, although it would be of interest 
to investigate the types of question students ask (see Gatto, 1928), the 
more important task is to identify the types of question which students 
should be encouraged to ask. For example, when introducing a new topic 
for study, teachers should probably ask students what they want to know 
about it. Finley (1921) found that elementary-school students had an aver- 
age of about five questions each to ask when presented with an unfamiliar 
animal in class. Another classroom situation in which student questions 
should probably be elicited occurs when a teacher has explained a new 
subject. Students should be queried about possible lack of understanding. In 
fact, one might offer the hypothesis that students encouraged to ask ques- 
tions in this type of situation will leam more than a group of students 
deprived of this opportunity. 
; Another key area for educational innovation is the training of students 
in question-asking skills. For example, what types of question should stu- 
dents ask themselves when they read a poem, a social studies textbook, or 
a science lesson? It seems that the shaping of student questioning skills has 
been a neglected feature of classroom learning. There has been increasing 
attention given to this problem since inquiry and discovery methods of 
teaching became prominent, but as Cronbach (1966) and others pointed 
out, research and training in these methods remain limited by the failure 
to adequately operationalize the concept. Perhaps the approach of focusing 
on specific questioning skills in various classroom situations, as I did above, 
would provide the clarity needed to operationalize the inquiry method. 


Programs to Change Teachers’ Questioning Behavior 


I have shown that the importance of questioning skills in teaching 
has been recognized by educators for more than a half-century. Yet rela- 
tively few programs have been implemented for the specific purpose of 
improving teachers’ questioning practices. This does not mean that the need 
for such programs has been ignored. More than 30 years ago, Houston 
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(1938) developed an inservice education program for the purpose of chang- 
ing teachers’ questioning practices, Among the techniques Houston used to 
effect behavioral change were group conferences, stenographic reports of 
each teacher's lessons, self-analysis, and supervisory evaluation. Examination 
of quantitative data yielded by pre- and post-training evaluations of 11 
teachers indicated that most of the teachers were able to effect substantial 
changes in specific aspects of their questioning behavior. As a group the 
teachers increased the percentage of questions relevant to the purpose of the 
lesson from 41.6% to 67.6%, the percentage of student participation from 
40.4% to 56.1%, and the percentage of questions requiring students to 
manipulate facts from 10% to 18%. There was also a reduction in a number 
of bothersome teaching habits such as repetition of one’s questions (from 
48 occurrences to none), repetition of students’ answers (from 5.5 to .6 
occurrences), answering of one’s own questions (from 3.5 to .3 occurrences), 
and interruption of student responses (from 10.3 to 1.5 occurrences). 


Recently a program was developed at the Far West Laboratory for 
Educational Research and Development (Borg, Kelley, Langer & Gall, 
1970) to help teachers achieve similar changes in their questioning behavior. 
Called a minicourse, it is a self-contained, inservice training package re- 
quiring about 15 hours to complete. The minicourse relies on techniques 
such as modeling, self-feedback, and microteaching (Allen & Ryan, 1969) 
to effect behavioral change. In a field test with 48 elementary-school 
teachers, the minicourse produced many highly significant changes in 
teachers’ questioning behavior, as determined by comparisons of pre- and 
post-course videotapes of 20-minute classroom discussions: increase in fre- 
quency of redirection questions (questions designed to have a number of 
students respond to one student’s original question) from 26.7 to 40.9; 
increase in percentage of thought questions from 37.3% to 52.0%; and 
increase in frequency of probing questions (questions which require students 
to improve or elaborate on their original response) from 8.3 to 13.9. As in 
Houston’s program (1938), there was also a reduction in frequency of poor 
questioning habits: repetition of one’s questions (from 13.7 to 4.7 occur- 
rences); repetition of students’ answers (from 30.7 to 4.4 occurrences); 
and answering of one’s own questions (from 4.6 to .7 occurrences). The 
Far West Laboratory now supports the development of about 20 additional 
minicourses to deal with other types of classroom teaching such as tutoring, 
role-playing, lecturing, and the inquiry method. Many of these courses 
include training in questioning skills that are appropriate to the particular 
teaching-learning context. 

Other programs for improving teachers’ questioning practices have been 
developed, though these have generally had more limited objectives than 
the programs of Houston (1938) and Borg (1970). Shaver and Oliver 
(1964) trained teachers in the use of questioning methods appropriate to 
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discussion of controversial issues in the social studies. Suchman (1958) 
identified inquiry skills for science classes; training teachers in their use 
resulted in a significant increase in the number of questions asked by 
students. In social studies, Taba (1966) and her co-workers (1964) devel- 
oped a system of teacher training centered around questioning strategies. 
These questioning strategies were viewed as techniques which teachers 
could use to develop their students’ abilities in forming concepts, explaining 
cause-and-effect relationships, and exploring implications. 


Discussion 


This survey of research on questions over a fifty-year period reveals 
that the main trend has been the development of techniques to describe 
questions used by teachers in classroom practice. There is now considerable 
data regarding the incidence of teachers’ questions and the relative frequen- 
cies with which various types of questions are asked. I expect that research- 
ers will now turn their attention more toward the improvement of teachers’ 
questioning practices. 

Efforts to improve existing practices will probably move in several 
directions. First, whereas in the past researchers have developed taxonomies 
to describe questions which teachers ask, they need now to develop taxono- 
mies based on types of question which teachers should ask. This means 
that increasing attention must be paid to the definition of desirable educa- 
tional objectives and to the identification of questions and question sequences 
which will enable students to achieve these objectives. It was pointed out 
above that there are certain advantages to developing systems of question 
types which are curriculum- and situation-specific. The chief advantage 
is that teacher training in questioning methods is likely to be facilitated if 
specific rather than general types of question are learned. 


j It is important that teachers’ questions should not be viewed as an end 
in themselves, They are a means to an end—producing desired changes in 
student behavior. Therefore, researchers should give high priority to the 
tasks of identifying what these desired changes are and of determining 
whether new questioning strategies have the impact on student behavior 
which is claimed for them. Hunkins’s investigation (1967, 1968), discussed 
above, may serve as the prototype for future research in this area. In line 
with the concern with student behavior, researchers should develop more 
programs directed at the shaping of student skills in questioning. 


I would like to stress again the need for effective teacher training 
programs to implement desired questioning strategies in the classrooms. 
Sloan and Pate (1966), for example, called for strong inservice training 
programs in the questioning skills necessary for teaching the “new math- 
ematics” (SMSG) curriculum. If these programs are to succeed, they need 
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to incorporate two important features. First, teacher training should involve 
not only study of questioning strategies, but also guided practice in their 
use. As the findings of Borg and his colleagues (1970) seem in indicate, 
microteaching is an effective technique for providing this practice. Second, 
teachers cannot be expected to learn the inquiry method or any new 
pedagogy if it is presented to them in vague, general, undefined terms; they 
can be expected to learn new methods if the methods are presented, at 
least in part, as sets of specific types of questions asked in specific classroom 
situations. 

In the last analysis, the value of focusing on teachers’ questions is 
that they are the basic unit underlying most methods of classroom teaching. 
If this is true, then their continued study deserves the strong support of 
researchers. 
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